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INTRODUCTION: 

In  our  ongoing  project  entitled  “A  Search  for  Gene  Fusions/Translocations  in  Breast  Cancer”  we 
have  undertaken  a  systematic  evaluation  of  breast  cancer  to  map  disease-specific,  recurrent 
chromosomal  or  transcriptional  chimeras  in  breast  cancer  that  can  be  further  characterized  to 
develop  novel  biomarkers  and  therapeutic  targets.  Earlier,  we  reported  the  characterization  of  a 
subset  of  ER  positive  breast  cancer  patients  characterized  by  the  overexpression  of  AGTR1  who 
may  be  responsive  to  an  available  drug,  losartanl  (Rhodes  et  al,  2009).  We  also  provided  a  novel 
mechanistic  framework  for  the  overexpression  of  the  polycomb  group  protein  EZH2  in 
metastatic  breast  and  prostate  cancers,  involving  the  genomic  loss  of  its  negative  regulator, 
miRlOl  (Varambally  et  al,  2008).  Additionally,  we  had  reported  high  throughput  sequencing 
pipeline  for  a  directed  search  for  gene  fusions  in  cancers  using  next  generation  transcriptome 
sequencing  platforms  (Maher  et  al,  2009).  From  those  efforts,  we  had  identified  numerous  gene 
fusions  (70  in  over  40  cancer  samples)  that  mapped  to  loci  of  genomic  amplifications.  We 
shortlisted  several  fusion  candidates  that  involved  kinase  genes  and  other  genes  of  interest 
related  to  oncogenesis  for  further  study. 

Previously  we  described  the  exciting  discovery  and  characterization  of  two  novel  recurrent  and 
actionable  gene  fusions  in  our  breast  cancer  cohort  involving  MAST  and  Notch  genes.  Both 
MAST  and  Notch  family  gene  fusions  exerted  significant  phenotypic  effects  in  breast  epithelial 
cells  (Robinson  et  al,  2011).  We  also  reported  the  development  of  a  novel  bioinfonnatics  tool 
designed  to  facilitate  the  discovery  of  gene  fusions  from  next-generation  sequencing  data  (Iyer  et 
al,  2011);  as  well  as  a  study  that  furthers  our  understanding  of  the  role  of  microRNAs  in  cancer 
progression  (Cao  et  al,  2011). 

In  this  reporting  period,  we  extended  upon  the  analysis  of  gene  fusions  in  breast  cancer  and  a 
novel  study  of  cancer-specific  pseudogenes. 

BODY: 

A  detailed,  itemized  report  of  the  progress  in  work  follows: 

1 .  Characterization  of  recurrent  gene  fusions  in  breast  cancer: 

Application  of  high-throughput  transcriptome  sequencing  has  spurred  highly  sensitive 
detection  and  discovery  of  gene  fusions  in  cancer,  but  distinguishing  potentially  oncogenic 
fusions  from  random,  “passenger”  aberrations  has  proven  challenging.  We  examined  a 
distinctive  group  of  gene  fusions  that  involve  genes  present  in  the  loci  of  chromosomal 
amplifications — a  class  of  oncogenic  aberrations  that  are  widely  prevalent  in  breast  cancers. 
Integrative  analysis  of  a  panel  of  14  breast  cancer  cell  lines  comparing  gene  fusions 
discovered  by  high-throughput  transcriptome  sequencing  and  genome-wide  copy  number 
aberrations  assessed  by  array  comparative  genomic  hybridization  led  to  the  identification  of 
77  gene  fusions,  of  which  more  than  60%  were  localized  to  amplicons  including  17ql2, 
17q23,  20ql3,  chr8q,  among  others.  Many  of  these  fusions  appeared  to  be  recurrent  or 


1 


Chinnaiyan,  Arul  M. 
DOD  Era  of  Hope  (W81XWH-08-1-01 10) 

involved  highly  expressed  oncogenic  drivers,  frequently  fused  with  multiple  different 
partners,  but  sometimes  displaying  loss  of  functional  domains. 

Here  we  carried  out  a  systematic  analysis  of  the  association  between  gene  fusions  and 
genomic  amplification  by  integrating  RNA-Seq  data  with  array  comparative  genomic 
hybridization  (aCGH)-based  whole  genome  copy  number  profiling  from  a  panel  of  breast 
cancer  cell  lines.  We  examined  a  set  of  “amplicon-associated  gene  fusions”  that  refer  to  all  the 
fusions  where  one  or  both  gene  partners  are  localized  to  a  site  of  chromosomal  amplification. 
We  found  that  as  many  as  62%  of  the  total  number  of  fusions  were  associated  with  regions  of 
amplifications  (Figure  1). 


Figure  1.  Distribution  of  gene  fusions  across  breast  cancer  cell  lines.  Pie  chart 
representation  of  the  relative  proportion  of  gene  fusions  associated  with  loci  of  genomic 
amplifications  compared  to  unamplified  loci  (left)  and  bar  graph  representation  of  the 
relative  distribution  of  gene  fusions  across  different  breast  cancer  cell  lines  (right). 
Neoplasia.  2012  Aug;14(8):702-8 


We  next  assessed  the  functional  relevance  of  two  amplicon-associated  fusion  genes  involving 
oncogenic  kinases,  EGFR  and  RPS6KB1,  in  the  context  of  prioritizing  fusion  candidates 
important  in  tumorigenesis.  In  our  transcriptome  sequencing  compendiumof  89  breast  cancer 


cell  lines  and  tissues,  the  highest  expression 
of  EGFR  is  observed  inMDA-MB-468, 
potentially  resulting  from  a  focal 
amplification  at  chr7pl2.  In  addition,  we 
detected  an  EGFR  fusion  transcript  (EGFR- 
POLD1)  in  this  cell  line,  encoding  the  N- 
terminal  portion  of  EGFR,  completely  devoid 
of  the  tyrosine  kinase  domain.  Considering 
that  the  MDA-MB-468  harbors  both  MAST2 
and  EGFR  fusions,  we  wanted  to  assess  its 
relative  “dependence”  on  both  the  kinases. 
Surprisingly,  a  profound  reduction  in  cell 
proliferation  was  observed  on  siRNA 
knockdown  of  MAST2,  whereas  EGFR 
knockdown  showed  little  effect  (Figure  2). 
Next,  testing  the  possibility  of  EGFR 
amplicon  potentially  cooperating  with 
MAST2,  we  found  that  the  effect  of 


Figure  2.  Proliferation  assay  showing  absolute  cell 
count  (y  axis)  over  a  time  course  (x  axis)  after 
knockdown  with  EGFR  and/or  MAST2  siRNAs  in 
MDA-MB-468.  QPCR  assessment  of  knockdown 
efficiencies  relative  to  nontargeted  control  (NTC; 
inset).  Neoplasia.  2012  Aug;14(8):702-8 _ 


2 


Chinnaiyan,  Arul  M. 
DOD  Era  of  Hope  (W81XWH-08-1-01 10) 

combined  knockdown  of  EGFR  and  MAST2  was  comparable  with  that  of  MAST2 
knockdown  alone  (Figure  2),  further  suggesting  that  EGFR  amplification  does  not  signify  a 
driver  aberration. 

Next,  considering  that  BT-474  is  an 
ERBB2-positive  cell  line,  we  tested 
potential  dependence  of  these  cells 
on  the  RPS6KB1  protein. 
Surprisingly,  similar  to  our 
observations  with  EGFR  knockdown 
in  MDA-MB-468  cells,  here  we 
observed  only  a  small  effect  on  cell 
proliferation  after  shRNA 
knockdown  of  RPS6KB1,  in 
dramatic  contrast  to  the  effect  of 
ERBB2  knockdown  (Figure  3). 
Notably,  the  shRNA  knockdown  of 
RPS6KB1  led  to  a  significant 
depletion  of  the  full-length  protein 
yet  it  did  not  affect  cell  proliferation 
compared  with  ERBB2  protein 
depletion  (Figure  3,  inset). 
Therefore,  BT-474  cells  do  not 
display  a  dependence  on  RPS6KB1 
protein,  and  considering  that  the  RPS6KB1  fusion  product  is  completely  devoid  of  all 
functional  domains  of  RPS6KB1,  including  the  kinase  domain,  this  fusion  also  likely 
represents  a  passenger  event. 

Overall,  our  study  suggests  that  amplicon-associated  gene  fusions  in  breast  cancer  primarily 
represent  a  by-product  of  chromosomal  amplifications  that  constitutes  a  subset  of  passenger 
aberrations  and  should  be  factored  accordingly  during  prioritization  of  gene  fusion  candidates. 
(Neoplasia.  2012  Aug;14(8):702-8). 

2.  Next  Generation  Sequencing  Analysis: 

Pseudogenes  are  a  class  of  non-coding  RNA  transcripts  that  are  dysfunctional  relatives  of 
known  functional  genes  that  have  lost  their  protein  coding  ability  and  often  not  expressed. 
Aberrant  expression  of  several  functional  non-coding  RNA  in  cancer  has  been  previously 
described,  however  genome-wide  expression  of  pseudogenes  had  not  been  reported  for  any 
cancer  type.  We  developed  a  pseudogene  expression  pipeline  to  analyze  a  large  compendium 
of  paired-end  next  generation  sequencing  (RNASeq)  data  generated  from  293  samples, 
comprising  13  different  epithelial  cancers.  Our  integrative  approach  provided  evidence  of 
expression  for  2,082  distinct  pseudogenes  that  displayed  lineage-specific,  cancer-specific,  as 
well  as  ubiquitous  expression  patterns. 

Though  a  majority  of  the  pseudogenes  examined  were  found  in  both  cancer  and  benign 
samples,  we  observed  218  pseudogenes  expressed  only  in  cancer  samples,  of  which  178  were 


Figure  3.  Proliferation  assay  showing  absolute  cell  count  (y 
axis)  over  a  time  course  (x  axis)  after  knockdown  with  EGFR 
and/or  MAST2  siRNAs  in  MDA-MB-468.  QPCR  assessment  of 
knockdown  efficiencies  relative  to  nontargeted  control  (NTC; 
inset).  Neoplasia.  2012  Aug;14(8):702-8 _ 
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observed  in  multiple  cancers  and  40  were  found  to  have  highly  specific  expression  in  a  single 
cancer  type  only  (Figure  4). 
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Figure  4.  Cancer- Specific 
Pseudogene  Expression 
Proflles(A)  Heatmap  of 
pseudogene  expression  sorted 
according  to  cancer-specific 
expression  patterns  displays 
pseudogene  transcripts 

specific  to  individual  cancers 
(top),  common  across 
multiple  cancers  (tissue- 
enriched;  middle),  and 
nonspecific  (bottom).  (B) 

Zoomed-in  version  of  the  top 
panel  displaying  individual 
cancer-specific  expressed 
pseudogenes.  The  columns 
represent  different  tissues 
with  the  number  of  samples 
in  parentheses.  The  rows 
represent  individual  clusters 
mapping  to  specific 
pseudogenes.  The  color 
intensity  represents  the 
frequency  (%)  of  samples  in 
a  tissue  type  showing 
expression  of  a  given 
pseudogenes  (according  to 
the  scale  indicated  at  the 
bottom).  Cell.  2012  Jun 
22;  149(7):  1 622-34. 


Among  the  pseudogene  candidates  in  breast  cancer,  we  identified  an  unprocessed  pseudogene 
cognate  to  ATP8A2,  a  LIM  domain-containing  protein  speculated  to  be  associated  with  stress 
response  and  proliferative  activity.  ATP8A2-VP  expression  found  to  be  restricted  to  breast 
samples,  the  highest  levels  seen  in  a  subset  of  breast  cancer  tissues  and  cell  lines  (Figure  4). 
By  contrast,  ATP8A2-WT  expression  was  highly  variable  across  different  tissue  types  and 
showed  no  correlation  with  ATP8A2-'F  expression.  To  investigate  a  potential  role  of 
ATP8A2-lP  expression  in  breast  cancer,  first  we  carried  out  siRNA-based  knockdown  of  both 
the  wild-type  and  pseudogene  RNA  in  two  independent  breast  cancer  cell  lines  that  expressed 
both  the  transcripts.  Knockdown  of  ATP8A2-lF  with  two  independent  siRNAs  was  found  to 
specifically  inhibit  the  proliferation  of  overexpressing  cell  lines  Cama-1  and  HCC1806 
(Figure  5A),  but  not  the  cell  lines  with  no  detectable  levels  of  ATP8A2-VF,  for  example,  the 
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benign  breast  epithelial  cell  line  H16N2  (Figure  5 A,  right).  Knockdown  of  ATP8A2-'F  (but 
not  ATP8A2-WT)  also  resulted  in  reduced  cell  migration  and  invasion  seen  in  in  vitro  Boyden 
Chamber  assays  (Figure  5B)  as  well  as  in  in  vivo  intravasation  and  metastasis  in  chicken 

chorioallantoic  mem¬ 
brane  xenograft  assay 
(Figure  5C).  In 
contrast,  knockdown 
of  wild-type  ATP8A2 
had  no  effect  on  the 
proliferation  of  any  of 
the  cell  lines  tested, 
suggesting  an  un¬ 
expected  growth  reg¬ 
ulatory  role  for 
ATP8A2-'F. 
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Figure  5.  (A)  Cell  proliferation  assays  following  siRNA  knockdowns  of 
ATP8A2-WT  and  -'P  as  indicated.  NTC,  nontargeting  control;  WT,  siRNA 
against  wild-type  ATP8A2;  *P,  siRNA  against  ATP8A2-*P.(B)  Boyden 
chamber  assay  showing  cell  migration  (left)  and  invasion  through  matrigel 
(right).  (C)  chicken  chorioallantoic  membrane  assay  of  HCC-1806  cells 
treated  with  nontargeting  control  siRNA,  ATP8A2-WT,  or  ATPA2-*P  siRNA 
showing  relative  number  of  cells  intravasated  in  the  lower  CAM  (left)  and 
metastatic  cells  in  chicken  lung  (right). Error  bars  represent  means  ±  SE  of  the 
mean.  Cell.  2012  Jun  22;149(7):1622-34. _ 


This  study  is  the  first 
large-scale  analysis  of 
pseudogene  expres¬ 
sion  in  human  cancer 
using  transcriptome 
sequencing  data. 
(Cell.  2012  Jun 
22;149(7):  1622-34). 


KEY  RESEARCH  ACCOMPLISHMENTS: 

•  We  performed  an  integrated  analysis  combining  RNASeq  and  aCGH  to  examine 
amplicon-associated  gene  fusions  across  14  breast  cancer  cell  lines.  We  found  that 
many  of  these  fusions,  even  when  they  involve  known  oncogenes,  are  often 
“passenger”  events  that  do  not  display  oncogenic  potential. 

•  We  used  a  novel  bioinformatics  approach  to  analyze  next  generation  sequencing  data 
to  discover  novel  expressed  pseudogenes.  Although  many  of  the  pseudogenes  are 
ubiquitously  expressed,  we  found  a  sub-set  of  them  are  expressed  in  a  lineage  and 
cancer-specific  manner,  including  the  breast  cancer-specific  pseudogene,  ATP8A2VP. 
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REPORTABLE  OUTCOMES: 

Kalyana-Sundaram  S,  Kumar-Sinha  C,  Shankar  S,  Robinson  DR,  Wu  YM,  Cao  X,  Asangani  IA, 
Kothari  V,  Prensner  JR,  Lonigro  RJ,  Iyer  MK,  Barrette  T,  Shanmugam  A,  Dhanasekaran  SM, 
Palanisamy  N,  Chinnaiyan  AM.  Expressed  pseudogenes  in  the  transcriptional  landscape  of 
human  cancers.  Cell.  2012  Jun  22;149(7):  1622-34. 

PubMed  PMID:  22726445. 

Kalyana-Sundaram  S,  Shankar  S,  Deroo  S,  Iyer  MK,  Palanisamy  N,  Chinnaiyan  AM,  Kumar- 
Sinha  C.  Gene  fusions  associated  with  recurrent  amplicons  represent  a  class  of  passenger 
aberrations  in  breast  cancer.  Neoplasia.  2012  Aug;14(8):702-8.  PubMed  PMID:  22952423; 
PubMed  Central  PMCID:  PMC343 1177. 

CONCLUSION: 

This  past  year  we  extended  upon  our  previous  studies  that  identified  two  rare  but  recurrent  gene 
fusions  in  breast  cancer  cell  lines  and  tissues  involving  the  potentially  actionable  MAST  and 
Notch  genes.  We  analyzed  amplicon-associated  gene  fusions  across  14  cell  lines  and  conclude 
that  many  of  them  are  likely  passenger  events  that  are  not  oncogenic  drivers.  In  addition  we  used 
a  novel  bioinformatics  approach  to  discover  expressed  pseudogenes.  Of  particular  interest  are 
those  that  display  cancer-specific  expression,  including  breast  cancer,  and  confer  cell 
proliferative  and  metastatic  properties.  This  represents  another  layer  of  complexity  of  cancer 
biology  that  was  previously  unappreciated. 

“So  What?:  The  tools  that  we  develop  to  identify  novel  gene  fusions  and  other  drivers  of 
tumorigenesis  along  with  the  biological  functional  analysis  lays  the  framework  for  developing 
personalized  breast  cancer  therapies  based  on  driving  mutation. 
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UPDATED  EXISTING/PENDING  SUPPORT  STATEMENT  (UEPS) 


EFFORTS  IN  BREAST  CANCER  RESEARCH:  50%  Total 

ACTIVE 

W81XWH-08-01 10  (PI:  Chinnaiyan)  09/01/08  -  08/31/13  3.0  cal  mos 

Department  of  Defense  -  Era  of  Hope  $500,000/yr 

A  Search  for  Gene  Fusions/T ranslocations  in  Breast  Cancer 

Specific  Aims:  1)  develop  high-throughput  adaptations  of  existing  methodologies  such  as  fluorescence  in 
situ  hybridization  (FISH),  2)  employ  bioinformatics  and  associated  analytical  tools  to  elucidate  recurrent 
gene  fusions  in  breast  cancers,  3)  employ  next  generation  whole  transcriptome  sequencing  of  breast  tumors. 
Contact  Information  at  funding  agency:  Grants  Officer:  Cheryl  A.  Lowery,  301-619-7150, 
Chervl.Lowerv@us.army.mil,  U.S.  Army  Medical  Research  Acquisition  Activity,  820  Chandler  Street 
(MCMR-AAA-R),  Fort  Detrick,  MD  21702-5014 


Effort  to  breast  cancer:  25% 


W81XWH-12-1-0080  (PI:  Chinnaiyan)  09/15/12-09/14/17  1.2  cal  mos 

Department  of  Defense  $479,470/yr 

Advancing  our  understanding  of  the  etiologies  and  mutational  landscapes  of  basal-like,  luminal  A,  and 
luminal  B  breast  cancers 

Specific  Aims:  1)  Identify  and  quantify  risk  factors  for  each  of  the  most  common  molecular  subtypes  of 
breast  cancer,  basal-like,  luminal  A,  and  luminal  B  tumors,  in  a  large-scale  population-based  study.  2) 
Discover  and  validate  the  mutational  landscape  of  basal-like,  luminal  A,  and  luminal  B  tumors.  3) 
Characterize  the  relationships  between  subtype  specific  risk  factors  and  mutational  signatures.  4)  Develop 
and  validate  risk  prediction  models  unique  to  each  breast  cancer  subtype  incorporating  clinical, 
epidemiologic  and  mutation  data.  5)  Identify  and  quantify  the  relationships  between  various  exposures  and 
mutational  changes  on  risk  of  breast  cancer  recurrence  and  survival  among  patients  with  basal-like,  luminal 
A,  and  luminal  B  tumors. 

Contact  Information  at  funding  agency:  Cheryl  A.  Lowery,  U.S.  Army  Medical  Research  Acquisition 
Activity,  820  Chandler  Street  (MCMR-AAA-R),  Fort  Detrick,  MD  21702-5014,  301-619-7150, 
ChervLLowerv@us.army.mil 


Effort  to  breast  cancer:  10% 


PI:  Chinnaiyan  01/01/09  -  12/31/13  1.2  cal  mos 

Doris  Duke  Foundation  $275,000/yr 

Distinguished  Clinical  Scientist  Award  for  Excellence  in  "Bench  to  Bedside"  Research 

Goal(s):  to  launch  a  new  effort  in  the  laboratory  to  comprehensively  and  systematically  scour  common 

human  solid  tumors  for  the  presence  of  recurrent  gene  rearrangements.  This  effort  primarily  funds  the 

training  of  new  translational  researchers  under  the  mentorship  of  Dr.  Chinnaiyan. 

Specific  Aims:  1)  Develop  and  employ  high-throughput  fluorescence  in  situ  hybridization  (FISH)  in  order  to 
interrogate  solid  tumors  for  recurrent  chromosomal  aberrations  including  gene  fusions  and  translocations;  2) 
Employ  bioinformatics  and  associated  analytical  tools  to  elucidate  recurrent  gene  fusions  in  common  solid 
tumors;.  3)  Employ  next  generation  whole  transcriptome  and  paired-end  sequencing  of  common  solid 
tumors  to  identify  recurrent  gene  fusions  and  integrated  non-human  sequences  that  may  represent 
pathogens. 
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Contact  Information  at  funding  agency:  Grants  Officer:  Betsy  Myers,  emyers@ddcf.org,  Doris  Duke 
Charitable  Foundation,  650  5th  Avenue,  FI  19,  NY,  NY;  Phone:  212-974-7000; 

R01-HG-0051 19  (PI:  Qin)  07/22/09  -  06/30/13  0.12  cal  mos 

NIH/NHGRI  $32,154/yr 

Model-Based  Methods  of  Analyzing  ChIP  Sequencing  Data 

Goal(s):  to  demonstrate  that  effective  data  integration  under  a  coherent  probability  framework  will  lead  to 
an  in-depth  understanding  of  mechanisms  mediating  transcription  regulation  in  cancer  progression. 

Role:  Co-Investigator 

Contact  Information  at  funding  agency:  Teresa  Sussman,  Emory  University,  1518  Clifton  Road  NE,  8th  FI, 
Atlanta,  GA  30322,  tpoint@emory.edu;  Phone  404-727-2503 


Effort  to  breast  cancer:  10% 

Summary  of  results:  Tangible  progress  has  been  made  in  the  development  of  new  bioinformatics  tools  for 
recurrent  gene  fusion  discovery  in  solid  tumors.  Our  COPA  (Cancer  Outlier  Profile  Analysis)  approach  to 
gene  fusion  discovery  was  detailed  in  our  initial  proposal,  and  we  have  since  expanded  upon  this  concept 
with  applications  to  breast  cancer.  We  sought  to  identify  driving  gene  fusions  in  breast  cancer  using  a 
combination  of  a  meta-analysis  for  gene  expression  and  validation  across  31  gene  expression  studies  and 
approximately  3200  microarray  experiments.  The  culmination  of  these  efforts  were  reported  with  the 
discovery  that  the  gene  AGTR1  (angiotensin  II  receptor  type  1)  was  over  expressed  in  10-20%  of  the  31 1 
breast  cancer  cases  tested  by  FISH  (Rhodes,  et  al,  2009).  The  highest  overexpression  of  AGTR1 
occurred  in  estrogen  responsive  ERBB2-negative  tumors.  This  is  particularly  exciting,  in  that  AGTR1  has 
already  been  targeted  by  an  existing  hypertension  medication  (Losartan)  which  has  already  been  FDA 
approved.  In  vivo  studies  performed  in  a  mouse  model  show  promising  results  for  the  possible  use  of  this 
agent  in  treating  breast  cancer  patients  with  AGTR1 -overexpressing  breast  tumors. 

A  new,  sensitive,  high-throughput  analytic  method  was  developed  by  our  group  to  mine  for  functional  gene 
fusions  using  paired-end  transcriptome  sequencing  (Maher,  et  al.,  PNAS,  2009).  We  performed  paired-end 
analysis  on  the  MCF-7  breast  cancer  cell  line,  which  resulted  in  the  identification  of  an  additional  five  new 
mutations  in  breast  cancer.  We  identified  two  recurrent,  actionable  gene  fusions  in  a  subset  of  breast 
cancer  cohorts,  involving  the  MAST  and  Notch  genes.  Moreover,  breast  cancer  patients  harboring  MAST 
or  Notch  fusions  could  potentially  be  treated  with  their  respective  inhibitors. _ 


2U01CA1 1 1275-06  (PI:  Chinnaiyan)  08/01/10  -  06/30/15  1.2  cal  mos 

NIH/NCI  $370,000/yr 

EDRN:  A  Systems  Biology  Approach  to  the  Development  of  Cancer  Biomarkers 

Goal(s):  Develop  “Omics  based  approaches  to  study  solid  tumor  for  the  purpose  of  developing  biomarkers. 
Specific  aims:  Platform  technologies  cover  epigenetics,  genomics,  transcriptomics,  proteomics  and 
metabolomics. 

Contact  Information  at  funding  agency:  Wendy  Briscoe,  briscoew@mail.nih.gov,  Phone:  301-496-3160, 

Fax:  301-496-8601,  EPS-  6120  Executive  Blvd,  243,  Rockville,  MD  20852 


Effort  to  breast  cancer:  5% 

Summary  of  results:  While  this  grant  has  been  focused  on  prostate  cancer,  in  general  it  is  a 
biomarker  development  lab  and  half  of  my  effort  can  be  designated  to  the  development  of  breast 
cancer  biomarkers  including  AGTR  in  ER+,  erbB2-  patients. _ 
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AACR  (Dream  team  leader:  Chinnaiyan)  08/01/12  -  8/31/15  1.2  cal  mos 

Stand  up  to  Cancer  and  Prostate  Cancer  Foundation  Dream  Team  $440,02 1/yr 

Precision  Therapy  of  Advanced  Prostate  Cancer 

Goal(s):  The  overall  goal  of  this  proposal  is  to  catalyze  the  interaction  of  a  multi-disciplinary  team  of 
investigators,  with  a  track  record  of  accomplishments  in  prostate  cancer  research,  to  work  together  on  the 
challenging  problem  of  metastatic  castration  resistant  prostate  cancer  (CRPC). 

Specific  Aims:  1)  Establish  a  multi-institutional  infrastructure  incorporating  5  leading  prostate  cancer 
clinical  sites,  2  sequencing  and  computational  analysis  sites,  linked  with  appropriate  sample  and  data 
coordination;  2)  Establish  a  prospective  cohort  of  500  patients  (the  “CRPC  500”)  utilizing  the  multi- 
institutional  infrastructure  to  support  the  clinical  use  of  integrative  prostate  cancer  sequencing,  analysis,  and 
clinical  trial  decision  making;  3)  Conduct  parallel,  preclinical  in  vivo  functional  studies  of  resistance 
biomarkers  and  of  SU2C-PCF  sponsored  combination  therapies;  4)  Identify  molecular  determinants  of 
abiraterone  sensitivity  and  acquired  resistance  in  patients;  5)  Conduct  clinical  trials  of  novel  combinations 
targeting  AR  and/or  the  PTEN  pathway,  based  on  existing  preclinical  data  and  an  understanding  of 
resistance  mechanisms;  6)  Identify  molecular  determinants  of  sensitivity  and  acquired  resistance  to  PARP 
inhibitors  in  patients. 

Contact  Information  at  funding  agency:  Frederic  Biemar,  ffederic.biemar@aacr.org,  (215)  446-7261 


W81XWH- 11-1-0337  (PI:  Chinnaiyan)  09/30/11  -09/29/15  1.2  cal  mos 

Department  of  Defense  $  145, 145/yr 

Prostate  Cancer  Research  Program  Idea  Development  Award,  Established  Investigator 

Discovery  of  Novel  Gene  Elements  Associated  with  Prostate  Cancer  Progression 

Goal(s):  determine  if  prostate  cancer  harbors  numerous  uncharacterized  ncRNAs  and  show  that  a  subset  of 
these  is  differentially-expressed  transcripts 

Specific  Aims:  1)  to  employ  next  generation  sequencing  to  comprehensively  annotate  expressed  regions  in 
the  prostate  cancer  transcriptome;  2)  to  validate  and  characterize  transcriptional  units  in  poorly  annotated 
regions;  3)  to  elucidate  a  functional  and  clinical  role  of  poorly-annotated  transcripts  in  prostate  cancer. 
Contact  Information  at  funding  agency:  Janet  Kuhns,  301-619-2827,  ianet.kuhns@us.army.mil,  U.S.  Army 
Medical  Research  Acquisition  Activity,  820  Chandler  Street  (MCMR-AAA-R),  Fort  Detrick,  MD  21702- 
5014 

R01CA154365  (Pis:  Beer,  Chinnaiyan)  12/01/10  -  1 1/30/15  0.36  cal  mos 

NIH/NCI  $85,529/yr 

Identification  and  Characterization  of  Gene  Fusions  in  Lung  Adenocarcinoma 

Goal(s):  are  to  identify  recurrent  gene  fusions  in  human  lung  adenocarcinoma  and  to  detennine  the 

functional  consequences  of  their  action  in  lung  adenocarcinoma  cell  lines.  Prioritization  will  be  to  examine 

those  lung  gene  fusions  that  may  be  therapeutically  targetable. 

Contact  Information  at  funding  agency:  Rebecca  Brightful,  Email:  brightfr@mail.nih.gov  Phone:  301-631- 
3011  Fax:  301-451-5391, 


R01CA 132874-01  (PI:  Chinnaiyan)  03/01/09  -  12/31/13  0.96  cal  mos 

NIH/NCI  $166,000/yr 

Molecular  Sub-typing  of  Prostate  Cancer  Based  on  Recurrent  Gene  Fusions 

Goal(s):  Identification  of  novel  molecular  subtypes  of  cancer,  characterization  of  these  subtypes,  and 

correlation  of  these  with  disease  outcome  using  prostate  needle  biopsy  samples. 
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Specific  Aims:  1)  discovery  and  nomination  of  novel  molecular  sub-types  of  prostate  cancer,  2)  characterize 
associations  of  molecular  sub-types  of  prostate  cancer  with  clinical  outcome  and/or  aggressiveness  of 
disease  in  a  radical  prostatectomy  cohort,  3)  characterize  associations  of  molecular  sub-types  of  prostate 
cancer  with  clinical  outcome  and/or  aggressiveness  of  disease  using  prostate  needle  biopsy  samples. 

Contact  Information  at  funding  agency:  Grants  Management  Specialist:  Rebecca  Brightful,  Email: 
brightfr@mail.nih.gov  Phone:  301-631-3011 


P50  CA69568  (PI:  Pienta)  06/01/08  -05/31/13  0.78  cal  mos 

NIH/NCI  $177,509/yr 

SPORE  in  Prostate  Cancer 

Project  1  Title:  Role  of  gene  fusions  in  prostate  cancer 

Goal(s):  determine  the  role  of  ETS  family  gene  fusions  in  prostate  cancer  cell  lines;  characterize  the 
phenotype  of  androgen-regulated  ETS  transgenic  mice. 

Specific  Aims:  1)  Characterization  of  Oncogenic  ETS  Gene  Fusions  in  Prostate  Cancer;  2)  Determine  the 
role  of  ETS  family  gene  fusions  in  prostate  cancer  cell  lines;  3)  characterize  the  phenotype  of  androgen- 
regulated  ETS  transgenic  mice. 

Role:  Co-Investigator 

Contact  Information  at  funding  agency:  Andrew  Hruszkewycz,  301-496-8528,  hruszkea@mail.nih.gov 

P50  CA69568  (PI:  Pienta)  06/01/08-05/31/13  0.48  cal  mos 

SPORE  in  Prostate  Cancer  $306,062/yr 

Core  3:  Tissue/Informatics  Core  Director  NIFI/NCI  Goal(s):  the  goal  of  the  Core  is  to  collect  biological 
material  with  associated  clinical  infonnation  to  facilitate  translational  research. 

Role:  Core  Director 

Contact  Information  at  funding  agency:  Andrew  Hruszkewycz,  301-496-8528,  hruszkea@mail.nih.gov 

PENDING: 

1UM1HG006508-01A1  (PI:  Chinnaiyan)  03/01/13  -  02/28/17  1.2  cal  mos 

National  Institutes  of  Health  1,500,000/yr 

Exploring  Precision  Cancer  Medicine  for  Sarcoma  and  Rare  Cancers 

Specific  Aims:  Project  1)  Clinical  Genomic  Study,  1)  Accrue  500  patients  with  advanced  or  refractory  rare 
cancer  for  participation  in  an  integrated  approach  to  Clinical  Genomics;  2)  Interpret  results  through  a  multi¬ 
disciplinary  Sequencing  Tumor  Board  and  disclose  results  to  patients  and  their  physicians;  3)  Measure  the 
influence  of  sequence  results  provided  to  patients;  4)  Determine  the  frequency  of  clinically  significant 
germline  mutations  in  patients  undergoing  comprehensive  tumor  sequence  analysis. 

Project  2)  Sequencing,  Analysis,  and  Interpretation  of  Sequencing  Data;  1)  Process  and  track  specimens  and 
ensure  quality  control;  2)  Sequence  tumor  and  germline  biospecimens;  3)  Analyze  sequencing  data  to 
identify  clinically  significant  variants;  4)  Interpret  and  translate  sequence  variants  into  clinical  oncology 
setting;  5)  Assess  and  evaluate  costs  associated  with  clinical  sequencing. 

OVERLAP  FOR  ALL  CURRENT  AND  PENDING  GRANTS: 

None. 
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SUMMARY 

Pseudogene  transcripts  can  provide  a  novel  tier 
of  gene  regulation  through  generation  of  endoge¬ 
nous  siRNAs  or  miRNA-binding  sites.  Characteriza¬ 
tion  of  pseudogene  expression,  however,  has  re¬ 
mained  confined  to  anecdotal  observations  due  to 
analytical  challenges  posed  by  the  extremely  close 
sequence  similarity  with  their  counterpart  coding 
genes.  Here,  we  describe  a  systematic  analysis 
of  pseudogene  “transcription”  from  an  RNA-Seq 
resource  of  293  samples,  representing  13  cancer 
and  normal  tissue  types,  and  observe  a  surprisingly 
prevalent,  genome-wide  expression  of  pseudogenes 
that  could  be  categorized  as  ubiquitously  expressed 
or  lineage  and/or  cancer  specific.  Further,  we  explore 
disease  subtype  specificity  and  functions  of  selected 
expressed  pseudogenes.  Taken  together,  we  pro¬ 
vide  evidence  that  transcribed  pseudogenes  are 
a  significant  contributor  to  the  transcriptional  land¬ 
scape  of  cells  and  are  positioned  to  play  significant 
roles  in  cellular  differentiation  and  cancer  progres¬ 
sion,  especially  in  light  of  the  recently  described 
ceRNA  networks.  Our  work  provides  a  transcriptome 
resource  that  enables  high-throughput  analyses  of 
pseudogene  expression. 

INTRODUCTION 

Pseudogenes  are  ancestral  copies  of  protein-coding  genes 
that  arise  from  genomic  duplication  or  retrotransposition  of 
mRNA  sequences  into  the  genome  followed  by  accumulation 


of  deleterious  mutations  due  to  loss  of  selection  pressure, 
degenerating  eventually  into  so-called  genetic  fossils  (Sasid- 
haran  and  Gerstein,  2008).  Pseudogenes  pervade  the  genome, 
representing  virtually  every  coding  gene,  and  due  to  their 
extremely  close  sequence  similarity  with  their  cognate  genes, 
complicate  whole-genome  sequencing  and  gene  expression 
analyses.  A  growing  body  of  evidence  strongly  suggests  their 
potential  roles  in  regulating  cognate  wild-type  gene  expres¬ 
sion/function  by  serving  as  a  source  of  endogenous  siRNA 
(Tam  et  al.,  2008;  Watanabe  et  al.,  2008),  antisense  transcripts 
(Zhou  et  al.,  1992),  competitive  inhibitors  of  translation  of 
wild-type  transcripts  (Kandouz  et  al.,  2004),  and  perhaps  domi¬ 
nant-negative  peptides  (Katoh  and  Katoh,  2003).  Pseudogene 
transcription  has  also  been  shown  to  regulate  cognate  wild- 
type  gene  expression  by  sequestering  miRNAs  (Poliseno 
et  al.,  2010).  The  recently  described  competing  endogenous 
RNA  (ceRNA)  networks  comprising  sets  of  coordinately  ex¬ 
pressed  genes  with  shared  miRNA  response  elements  (MREs) 
provide  an  additional  dimension  of  (post-)  transcriptional  regu¬ 
lation  in  which  the  role  of  pseudogenes  might  overlap  with 
those  of  protein-coding  genes  (Salmena  et  al.,  2011;  Sumazin 
et  al.,  2011). 

Previous  genome-wide  studies  of  pseudogenes  focused  on 
the  identification  of  their  chromosomal  coordinates  and  annota¬ 
tions  based  on  diverse  computational  approaches  (Karro  et  al., 
2007;  Zhang  and  Gerstein,  2004),  including  PseudoPipe  (Zhang 
et  al.,  2006),  HAVANA  (Solovyev  et  al.,  2006),  PseudoFinder  (Lu 
and  Haussler,  2006,  ASHG,  conference),  and  Retrofinder  (Zheng 
and  Gerstein,  2006).  These  individual  pipelines  were  subse¬ 
quently  consolidated  into  an  integrated  consensus  platform, 
ENCyclopedia  Of  DNA  Elements  (ENCODE),  which  now  serves 
as  the  definitive  database  of  manually  curated  and  annotated 
pseudogenes  as  well  as  pseudogene  transcripts  (Zheng  et  al., 
2007).  By  contrast,  genome-wide  analyses  of  pseudogene 
expression  have  been  somewhat  arbitrary,  mainly  relying  upon 
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Figure  1.  Pseudogene  Expression  Analysis 
Pipeline 

The  bioinformatics  pipeline  for  analyzing  pseudo¬ 
gene  transcription  involved  the  following  steps:  (1) 
Paired-end  transcriptome  sequencing  reads  were 
mapped  to  the  human  genome  and  UCSC  Genes 
using  ELAND.  (2)  Passed  purity  (PF)  filter  reads 
were  assigned  into  three  sequence  bins  as  indi¬ 
cated.  (3)  Paired  reads  with  one  or  both  partners 
mapping  to  unannotated  genomic  regions  were 
clustered  based  on  overlapping  alignments.  (4) 
Clusters  were  filtered  to  remove  singleton, 
stacked,  and  duplicate  reads.  (5)  To  determine 
a  consensus  pseudogene  annotation,  clusters 
were  scanned  through  the  Yale  and  ENCODE 
pseudogene  databases  as  well  as  analyzed  with 
a  BLAT-based  custom  homology  search.  Data 
from  individual  samples  were  then  compared  to 
generate  pseudogene  expression  signatures. 
Clusters  not  assigned  at  this  stage  were  cat¬ 
egorized  as  other  potentially  nonpseudogene 
transcripts. 

See  also  Figures  SI,  S2,  and  S3  and  Tables  SI 
and  S2. 


4.  FILTERING  NOISE 


Singleton  /  Stacked  reads 


Unique  overlapping  clusters 
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evidence  of  pseudogene  transcripts  obtained  from  disparate 
gene  expression  platforms,  including  public  mRNA  and  EST 
databases,  cap  analysis  gene  expression  (CAGE)  studies,  and 
gene  identification  signature-paired  end  tags  (GIS-PET)  (Ruan 
et  al.,  2007).  Given  the  essentially  anecdotal  observations  of 
pseudogene  expression,  only  160  expressed  human  pseudo¬ 
genes  are  currently  documented  in  ENCODE.  Though  this  could 
be  due  to  a  general  lack  of  transcription  of  pseudogenes,  as 
generally  presumed,  it  may  also  be  reflective  of  an  insufficient 
and  uneven  depth  of  coverage  afforded  by  early  gene  expression 
analysis  tools. 


In  this  context,  the  recent  maturation 
of  next-generation  high-throughput  se¬ 
quencing  platforms  provides  unprece¬ 
dented  access  to  genome-wide  expres¬ 
sion  analyses  previously  not  achievable 
(Han  et  al.,  2011a;  Morozova  et  al., 
2009).  Here,  we  analyzed  a  compendium 
of  RNA-Seq  transcriptome  data  specifi¬ 
cally  focusing  on  pseudogene  transcripts 
from  a  total  of  293  samples  encompass¬ 
ing  13  different  tissue  types,  including 
248  cancer  and  45  benign  samples.  In 
order  to  carry  out  a  systematic  analysis 
of  pseudogene  expression,  we  devel¬ 
oped  a  bioinformatics  pipeline  focused 
on  detecting  pseudogene  transcription. 
This  integrative  approach  provided 
evidence  of  expression  for  2,082  distinct 
pseudogenes,  which  displayed  lineage- 
specific,  cancer-specific,  as  well  as 
ubiquitous  expression  patterns.  Taken 
together,  this  Resource  nominates  a 
multitude  of  expressed  pseudogenes 
that  merit  further  investigation  to  determine  their  roles  in  biology 
and  in  human  disease. 

RESULTS 

Development  of  a  Bioinformatics  Platform 
for  the  Analysis  of  Pseudogene  Transcription 

Paired-end  RNA-Seq  data  from  a  compendium  of  293  samples, 
representing  both  cancer  and  benign  samples  from  13  different 
tissue  types  recently  generated  in  our  laboratory,  was  utilized 
to  build  a  pseudogene  analysis  pipeline  (Figure  1  and  Figure  SI 
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and  Table  SI  available  online).  Sequencing  reads  were  mapped 
to  the  human  genome  (hg18)  and  University  of  California  Santa 
Cruz  (UCSC)  Genes  using  Efficient  Alignment  of  Nucleotide 
Databases  (ELAND)  software  of  the  lllumina  Genome  Analyzer 
Pipeline  (Table  S2).  Reads  showing  mismatches  to  the  reference 
genes  but  mapping  perfectly  to  unannotated  regions  elsewhere 
in  the  genome  were  used  as  the  primary  data  for  pseudogene 
expression  analysis.  Two  or  more  unique,  high-quality  overlap¬ 
ping  reads  nucleating  at  the  loci  of  differences  between  wild- 
type  genes  and  pseudogenes  were  used  to  define  de  novo 
“clusters”  (ranging  from  40  to  5,000  bp).  These  clusters  were 
employed  for  gene  expression  analyses  in  a  way  analogous  to 
the  “probes”  used  in  microarray  gene  expression  studies, 
though  unlike  predesigned  and  fixed  probes  used  in  microar¬ 
rays,  the  sequence  clusters  used  here  were  formed  de  novo, 
solely  based  on  the  presence  (and  levels)  of  transcripts.  Thus, 
one  or  more  clusters  (like  one  or  more  probes  in  microarrays) 
represented  a  transcript,  whereas  the  number  of  reads  mapping 
to  a  cluster  (analogous  to  fluorescence  intensity  due  to  probe 
hybridization  on  microarrays)  provided  a  measure  of  expression 
of  the  corresponding  (pseudo)genes.  For  example,  Figure  2 
shows  a  schematic  representation  of  the  cluster  alignments  for 
two  representative  pseudogenes,  ATP8A2-' V  (Figure  2A)  and 
CXADR-H1  (Figure  2B).  As  can  be  seen,  mutation-dense  regions 
in  the  reference  sequence  provide  foci  of  pseudogene-specific 
cluster  formation.  Naturally,  pseudogenes  with  sparse  and 
dispersed  mutations  nucleate  fewer  clusters  and  require  higher 
depth  of  coverage  for  reliable  detection. 

Overall,  2,156  unique  pseudogene  transcript  clusters  were 
identified,  and  their  genomic  coordinates  (start  and  end  points) 
were  compared  with  the  coordinates  of  pseudogenes  annotated 
in  the  ENCODE  (Zheng  et  al.,  2007)  and  Yale  pseudogene 
resources  (http://www.pseudogene.org)  (Karro  et  al.,  2007), 
the  two  most  comprehensive  pseudogene  annotation  data¬ 
bases.  Genomic  coordinates  of  934  unique  pseudogene 
transcript  clusters  in  our  data  set  were  found  to  overlap  with 
the  pseudogene  coordinates  annotated  in  both  Yale  and 
ENCODE  databases.  In  addition,  585  clusters  overlapped  with 
Yale  and  92  with  ENCODE  databases,  displaying  a  high  degree 
of  overall  concordance  between  our  data  and  the  authentic 
resources  and  highlighting  a  level  of  difference  between  the 
two  reference  databases  (that  necessitated  our  consideration 
of  both  resources).  Further,  as  multiple  clusters  can  sometimes 
represent  one  distinct  pseudogene  transcript,  the  2,156 
transcript  clusters  provided  evidence  for  2,082  distinct  tran¬ 
scripts.  Of  these,  1 ,506  transcripts  overlap  with  the  genomic 
coordinates  of  pseudogenes  in  Yale  and/or  ENCODE,  and  up 
to  576  transcripts  are  potentially  novel  (described  below) 
(Figure  S2A).  The  2,082  pseudogene  transcripts,  in  turn,  corre¬ 
spond  to  1,437  wild-type  genes,  clearly  indicating  that  the 
transcripts  of  multiple  pseudogenes  arisen  from  the  same  wild- 
type  genes  are  also  detected  in  our  compendium.  Taken 
together,  our  study  provides  evidence  of  widespread  transcrip¬ 
tion  of  pseudogenes  unraveled  by  high-throughput  transcrip- 
tome  sequencing  (Table  S3). 

Pseudogene  clusters  across  the  sample-wise  compendium 
reveal  that  pseudogenes  of  housekeeping  genes  such  as 
ribosomal  proteins  are  widely  expressed  across  tissue  types. 


Additionally,  pseudogene  transcripts  corresponding  to  CALM2 
(calmodulin  2  phosphorylase  kinase,  delta),  TOMM40  (translo- 
case  of  outer  mitochondrial  membrane  40),  NONO  (non-POU 
domain-containing,  octamer-binding),  DUSP8  (dual-specificity 
phosphatase  8),  PERP  (TP53  apoptosis  effector),  and  YES 
(v-yes-1  Yamaguchi  sarcoma  viral  oncogene  homolog  1),  etc. 
were  observed  in  more  than  50  samples  each,  which  were  further 
validated  by  pseudogene-specific  RT-PCR  followed  by  Sanger 
sequencing  (Table  S4). 

Further,  because  our  RNA-Seq  compendium  comprises  35-  to 
45-mer  short  sequence  reads  that  largely  generated  short 
sequence  clusters  not  optimal  for  available  pseudogene  analysis 
tools  such  as  Pseudopipe  (Zhang  et  al.,  2006)  and  Pseudofam 
(Lam  et  al.,  2009)  used  in  generating  ENCODE  and  Yale  data¬ 
bases,  we  carried  out  a  direct  query  of  individual  clusters  against 
the  human  genome  (hg18)  using  the  EILAT  tool  from  UCSC, 
which  is  ideally  suited  for  short  sequence  alignment  searches 
(Kent,  2002).  Based  on  this  “custom”  analysis,  or  simply  BLAT 
(Figure  S2A),  we  were  able  to  independently  assign  1,888 
clusters  representing  1,820  unique  pseudogenes  to  unique 
genomic  locations. 

Detection  of  Potentially  Novel  Pseudogene  Transcripts 

Comparing  the  genomic  locations  of  the  pseudogene  clusters 
identified  by  BLAT  analysis  to  those  identified  by  Yale  and 
ENCODE  databases  (Figure  S2A),  762  clusters  were  found  to 
be  common  to  all  three  resources,  but  a  remarkably  large  set 
of  585  clusters  was  uniquely  defined  by  BLAT  analysis  alone. 
Some  of  the  pseudogene  transcripts  thus  identified  included 
BAT1,  BTBD1 ,  COX7A2L,  CTNND1 ,  E/F5,  PAPOLA,  PARP11, 
SYT,  ZBTB12,  and  others  (n  =  25)  and  were  validated  by  Sanger 
sequencing  (Table  S4).  Thus,  analysis  of  RNA-Seq  data  provided 
a  reliable  assessment  of  expressed  pseudogenes. 

Though  designating  the  BLAT-based  pseudogene  clusters 
as  novel  pseudogenes  must  await  further  sequence  character¬ 
ization  (such  as  analysis  of  ORF  structure  and  potential  genesis 
of  novel  protein-coding  gene  family  members,  etc.),  a  small 
subset  of  clusters  was  seen  to  be  localized  in  the  vicinity  of 
known  pseudogenes.  Thus,  we  found  92  clusters  that  resided 
adjacent  (within  5  kb)  to  previously  annotated  pseudogenes 
(Figure  S2B,  left),  and  we  hypothesize  that  these  may  represent 
pseudogenes  with  inaccurate  annotations  in  the  current 
databases.  For  example,  the  chromosomal  coordinates  of 
CENTG2-n>  (OTTHUMT00000085288,  Havana  processed 
pseudogene)  are  defined  in  ENCODE  as  Chrl :  177822463- 
177824935.  As  expected,  we  observed  a  cluster  mapping  to 
this  locus;  however,  interestingly,  we  also  observed  a  distinct 
cluster  (Chrl  :1 77825028-1 77826295)  less  than  100  base  pairs 
away.  Although  unannotated  in  the  current  databases,  the 
sequence  of  this  adjacent  locus  shows  a  high  degree  of 
homology  to  the  CENTG2  parental  gene  (Figure  S2B,  right), 
strongly  suggesting  that  this  cluster  represents  an  extension  of 
the  existing  genomic  coordinates  of  CENTG2-'P  annotation. 
Similar  observations  were  made  with  HNRNPA1  and  the 
HNRNPAI-m  on  Chr6q27  (Figure  S2B,  right).  493  BLAT  derived 
clusters  that  were  not  in  close  proximity  to  annotated  pseudo¬ 
genes  likely  represent  putative  pseudogenes  currently  missing 
in  the  database  annotations  (Table  S3B). 
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Figure  2.  Schematic  Representation  of  Cluster  Alignments  with  Pseudogene  Transcripts 

(A  and  B)  The  relative  genomic  structures  of  the  parental  genes  are  shown  aligned  to  the  respective  pseudogenes,  with  their  chromosomal  locations  indicated  on 
the  sides,  (A)  ATP8A2-'P  and  (B)  CXADR-W.  The  sequencing  alterations  distinguishing  the  pseudogene  from  the  parental  gene  are  indicated  in  red.  The 
pseudogene  transcripts  are  illustrated  as  black  bars  with  red  hatches,  which  indicate  divergence  from  the  parental  sequence,  and  the  length  of  the  transcript  in 
base  pairs  is  shown  on  the  side.  These  representations  are  then  overlaid  with  schematics  of  paired-end  reads  used  to  form  pseudogene  clusters  (in  blue), 
followed  by  overlapping  sequences  in  a  zoomed-in  region  of  the  cluster.  A  comparative  representation  of  the  parental  (WT)  and  pseudogene  (W)  sequences  for 
the  specified  region  is  shown  on  top. 

See  also  Figure  S4. 


Next,  we  assessed  the  technical  and  analytical  factors  influ¬ 
encing  the  yield  of  pseudogene  transcripts.  As  may  be  expected, 
a  positive  correlation  was  observed  between  the  sequencing 
depth  and  total  number  of  pseudogene  transcripts  (correlation 
coefficient,  +0.65)  (Figure  S3A).  However,  no  significant  correla¬ 
tion  was  observed  between  the  absolute  measure  of  percent 
similarity  between  pseudogene-WT  pairs  and  pseudogene  yield. 
Importantly,  the  metric  of  overall  percent  similarity  accounts  for 
gap  penalty  and  mismatches  in  BLAT  search,  but  it  is  the  “distri¬ 


bution”  of  the  mismatches  that  is  critical  in  resolving  pseudo¬ 
genes  from  nearly  identical  wild-type  sequences;  for  example, 
a  few  mismatches,  accumulated  in  a  small  stretch,  are  more 
effective  in  confidently  distinguishing  pseudogene  expression 
from  wild-types  as  compared  to  a  higher  number  of  mismatches 
that  are  scattered  over  long  stretches  of  sequence  (Figure  2). 
Thus,  three  primary  factors  determine  the  detection  of  pseudo¬ 
gene  transcription  by  RNA-Seq:  (1)  the  level  of  expression  of 
the  pseudogenes  (i.e.,  the  higher  the  level  of  expression,  the 
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higher  the  likelihood  of  detection),  (2)  the  depth  of  RNA 
sequencing,  and  (3)  overall  distribution  of  mismatches  with 
respect  to  the  wild-type. 

To  explore  the  loci  of  transcription  regulatory  elements  associ¬ 
ated  with  pseudogene  transcription,  we  carried  out  ChIP-Seq 
analysis  of  a  breast  cancer  cell  line  MCF7  probed  with 
H3K4me3,  a  histone  mark  associated  with  transcriptionally 
active  chromosomal  loci,  and  integrated  the  results  with  the 
MCF7  pseudogene  transcript  data.  Interestingly,  we  observed 
a  statistically  significant  enrichment  of  FI3K4me3  peaks  at 
expressed  pseudogene  loci  as  compared  to  nonexpressed 
pseudogenes  (p  =  0.0054)  (Figure  S3B),  suggesting  that  the 
pseudogene  transcripts  observed  by  RNA-Seq  are  associated 
with  transcriptionally  active  genomic  loci.  Interestingly,  the  pseu¬ 
dogene  transcripts  associated  with  FI3K4me3  peaks  encom¬ 
pass  both  unprocessed  and  processed  pseudogenes,  with  no 
discernible  differences  in  the  pattern  of  expression.  Considering 
the  role  of  3'  UTRs  with  MREs  in  ceRNA  regulatory  networks,  we 
also  looked  at  the  frequency  of  3'  UTR  sequences  retained  in  our 
set  of  pseudogene  transcripts  and  observed  that  at  least  71  %  of 
all  pseudogene  transcripts  retain  distinct  3'  UTR  sequences 
similar  to  their  cognate  wild-type  genes  (Figure  S3C).  Interest¬ 
ingly,  comparing  the  pseudogene  transcripts  with  a  list  of  genes 
implicated  in  ceRNA  networks  (Han  et  al.,  2011b;  Tay  et  al., 
2011),  we  observed  more  than  400  overlapping  transcripts 
(Table  S5).  The  presence  of  noncoding  pseudogene  transcripts 
with  similar  3'  UTRs  (and  MREs)  adds  a  further  level  of 
complexity  to  ceRNA  regulatory  networks. 

Next,  we  assessed  a  potential  correlation  between  the  expres¬ 
sion  of  pseudogenes  present  within  the  introns  of  unrelated, 
expressed  genes  with  their  “host”  genes.  Interestingly,  no  signif¬ 
icant  association  was  observed,  suggesting  that  pseudogenes 
are  likely  subject  to  independent  regulatory  mechanisms  even 
when  residing  within  other  transcriptionally  active  genes. 
Further,  our  observations  with  the  breast-specific  unprocessed 
pseudogene  ATP8A2  (likely  arisen  from  duplication  of  wild- 
type  ATP8A2,  thus  likely  harboring  similar  promoter  elements) 
also  indicate  that  there  is  no  apparent  correlation  between 
the  pseudogene  expression  with  the  wild-type  gene  that  is 
expressed  ubiquitously  (described  later).  Thus,  in  summary, 
although  it  is  tempting  to  speculate  that  pseudogene  expression 
may  be  regulated  by  the  promoter  elements  from  the  cognate 
gene  or  the  host  genes,  our  data  suggest  that  more  complex/ 
indirect  factors  may  be  at  play.  Next,  we  assessed  a  possible 
correlation  between  the  expression  of  pseudogenes  with  that 
of  cognate  wild-type  genes,  and  intriguingly,  no  significant 
pattern  of  correlation  was  observed  (Figure  S3D). 

Focusing  on  the  pseudogenes  whose  genomic  coordinates 
are  annotated  in  the  reference  databases,  we  next  analyzed 
the  expression  profiles  of  the  1 ,056  unique  transcripts. 

Patterns  of  Pseudogene  Expression  in  Human  Tissues 

Analyzing  the  expression  data  from  248  cancer  and  45  benign 
samples  from  13  different  tissue  types  (total  293  samples),  we 
observed  broad  patterns  of  pseudogene  expression,  including 
1,056  pseudogenes  that  were  detected  in  multiple  samples 
(Table  S6),  which  supports  the  hypothesis  that  transcribed  pseu¬ 
dogenes  contribute  to  the  typical  transcriptional  repertoire  of 


cells.  In  addition,  we  identified  distinct  patterns  of  pseudogene 
expression,  akin  to  that  of  protein-coding  genes,  including  154 
highly  tissue/lineage-specific  and  848  moderately  tissue/ 
lineage-specific  (or  enriched)  pseudogenes  (Figure  3A).  More¬ 
over,  we  found  165  pseudogenes  exhibiting  expression  in 
more  than  10  of  the  13  tissue  types  examined,  and  these  we 
classified  as  ubiquitous  pseudogenes  whose  transcription  is 
characteristic  of  most  cell  types  (Figure  3A,  bottom). 

Of  the  165  ubiquitous  pseudogenes,  a  majority  belonged  to 
housekeeping  genes,  such  as  glyceraldehyde  3-phosphate 
dehydrogenase  ( GAPDH ),  ribosomal  proteins,  several  cytokera- 
tins,  and  other  genes  widely  expressed  in  most  cell  types.  This  is 
expected,  as  these  genes  are  known  to  have  numerous  pseudo¬ 
genes,  and  it  is  likely  that  several  of  these  pseudogenes  retain 
the  capacity  for  widespread  transcription,  mimicking  their 
protein-coding  counterparts. 

A  second  set  of  pseudogenes  exhibited  near  ubiquitous 
expression  but  were  frequently  transcribed  at  lower  levels  in 
most  tissues  and  robustly  transcribed  in  one  or  two  tissues. 
These  pseudogenes  were  termed  “nonspecific,”  and  this  group 
harbors  more  than  870  pseudogenes,  comprising  a  large  portion 
of  our  data  set  (Figure  3A,  middle).  Many  of  the  pseudogenes 
previously  shown  to  be  expressed  were  found  in  this  category, 
including  some  pseudogenes  reported  as  tissue  specific, 
such  as  CYP4Z2P,  a  pseudogene  previously  reported  to  be  ex¬ 
pressed  only  in  breast  cancer  tissues  (Rieger  et  al.,  2004).  Other 
candidates  observed  in  this  category  include  pseudogenes 
derived  from  Oct-4  (Kastler  et  al.,  2010),  Connexin-43  (Bier 
et  al.,  2009;  Kandouz  et  al.,  2004),  and  BRAF  (Zou  et  al.,  2009), 
among  others  (Table  S6). 

Though  powerful,  our  approach  is  nevertheless  limited  to 
pseudogene  transcripts  that  are  expressed  above  the  current 
threshold  of  detection  by  RNA-Seq  and  possess  distinct 
stretches  of  sequence  mismatches  compared  with  their 
protein-coding  parental  genes.  Thus,  for  example,  PTENP1 , 
a  pseudogene  of  PTEN  recently  implicated  in  the  biology  of  the 
phosphatidylinositol  3-kinase  (PI3K)  signaling  pathway,  was 
not  detected  in  our  compendium  possibly  due  to  the  preponder¬ 
ance  of  cancer  samples  in  our  cohort,  which  tend  to  show  low 
expression  or  deletion  of  this  pseudogene  (Poliseno  et  al.,  201 0). 

Lineage-  and  Cancer-Specific  Pseudogene  Expression 
Signatures 

Lineage-specific  pseudogene  transcripts  may  have  the  potential 
for  lineage-specific  functions  and  may  represent  novel  elements 
that  facilitate  biological  characteristics  that  are  unique  to  distinct 
tissue  types.  In  this  regard,  we  observed  154  pseudogenes  with 
highly  specific  expression  patterns,  including  pseudogenes 
derived  from  AURKA  (kidney  samples),  RHOB  (colon  samples), 
and  HMGB1  (myeloproliferative  neoplasms  [MPNs])  (Figure  3A, 
top).  Interestingly,  however,  lineage-specific  pseudogenes 
tended  to  represent  a  small  fraction  of  all  pseudogenes  ex¬ 
pressed  in  a  given  tissue  type,  and  the  total  number  of  lineage- 
specific  pseudogenes  observed  in  a  tissue  type  did  not  show 
a  correlation  with  the  total  number  of  samples  analyzed.  For 
example,  B-lymphocyte  cells  (n  =  19)  and  MPNs  (n  =  9)  showed 
more  lineage-specific  pseudogenes  than  breast  (n  =  64)  or  pros¬ 
tate  (n  =  89).  Conversely,  we  did  observe  more  pseudogene 
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Figure  3.  Tissue/Lineage-Specific  Pseudogene  Expression  Profiles 

(A)  Heatmap  of  pseudogene  expression  sorted  on  the  basis  of  tissue- 
specific  expression  displays  tissue-specific  (top),  tissue-enriched/nonspecific 
(middle),  and  ubiquitously  expressed  pseudogenes  (bottom). 

(B)  Zoomed-in  version  of  the  top  panel  displaying  tissue-specific  expressed 
pseudogenes.  The  columns  represent  different  tissues,  with  the  number  of 
samples  in  parentheses.  The  rows  represent  individual  clusters  mapping  to 
specific  pseudogenes.  The  color  intensity  represents  the  frequency  (%)  of 
samples  in  a  tissue  type  showing  expression  of  a  given  pseudogenes 
(according  to  the  scale  indicated  at  the  bottom).  The  key  clusters  are  labeled 
with  their  corresponding  parental  gene  symbols.  MPN,  myeloproliferative 
neoplasms. 

See  also  Table  S6. 


transcripts  in  samples  with  longer  read  lengths  and  deeper 
coverage,  as  expected.  Together,  these  data  both  confirm  and 
formalize  previous  anecdotal  observations  of  lineage-specific 
pseudogene  expression  patterns  by  exploiting  the  power  of 
RNA-Seq  to  resolve  individual  transcripts  (Figure  3B)  (Bier 
et  al.,  2009;  Lu  et  al.,  2006;  Rieger  et  al.,  2004;  Zou  et  al.,  2009). 

Because  our  sample  compendium  has  a  substantial  number  of 
cancer  samples,  we  next  focused  on  pseudogenes  with  cancer- 
specific  expression.  Though  a  majority  of  the  pseudogenes 
examined  were  found  in  both  cancer  and  benign  samples,  we 
observed  218  pseudogenes  expressed  only  in  cancer  samples, 
of  which  178  were  observed  in  multiple  cancers  and  40  were 
found  to  have  highly  specific  expression  in  a  single  cancer 
type  only  (Figure  4A  and  Table  S7).  Consistent  with  our  previous 
results  (Figure  3),  we  found  that  the  number  of  cancer-type- 
specific  pseudogenes  did  not  correlate  with  the  number  of 
samples  sequenced  in  a  given  cancer  type.  These  results 
suggest  that  cancer  samples  harbor  transcriptional  patterns  of 
pseudogenes  that  are  both  lineage  and  cancer  specific. 

Among  the  cancer-specific  pseudogenes,  a  few  noteworthy 
examples  included  pseudogenes  derived  from  the  eukaryotic 
translation  initiation  factors  EIF4A1  and  EIF4H,  the  heteroge¬ 
neous  nuclear  ribonucleoprotein  FINRPFI2,  and  the  small  nuclear 
ribonucleoprotein  SNRPG  (Figure  4B).  Moreover,  we  observed 
pseudogenes  corresponding  to  known  cancer-associated 
genes,  including  RAB-1,  a  Ras-related  protein;  VDAC1 , 
a  type-1  voltage-dependent  anion-selective  channel/porin; 
RCC2,  a  regulator  of  chromosome  condensation  2;  and  PTMA, 
prothymosin  alpha.  Interestingly,  the  parental  protein-coding 
PTMA  gene  has  given  rise  to  five  processed  pseudogenes  that 
retain  consensus  TATA  elements,  individual  transcriptional  start 
sites,  and  intact  open  reading  frames  that  may  potentially  code 
for  proteins  closely  related  to  the  parental  PTMA  protein.  Impor¬ 
tantly,  we  find  expression  of  PTMA- derived  pseudogenes  in 
more  than  30  cancer  samples,  but  not  in  any  benign  cells,  and 
these  data  suggest  that  PT/WA-derived  pseudogenes  may  not 
only  contribute  transcripts  to  cancer  cell  biology  but  potentially 
proteins  as  well,  warranting  further  study  of  these  pseudogenes 
in  tumorigenesis. 

Prostate  Cancer  Pseudogenes 

To  investigate  individual  pseudogenes  in  greater  detail,  we 
focused  on  pseudogenes  associated  with  prostate  and  breast 
cancer,  as  our  compendium  has  a  substantial  number  of  these 
two  cancer  types  represented.  Analysis  of  lineage-specific  pseu¬ 
dogenes  restricted  to  prostate  cancers  identified  numerous 
pseudogenes,  including  several  derived  from  parental  genes 
known  to  be  altered  or  dysregulated  in  cancer;  for  example, 
NDUFA9,  which  encodes  an  NADH  oxidoreductase  component 
of  mitochondrial  complex  I  that  is  reported  to  be  upregulated  in 
testicular  germ  cell  tumors  (Dormeyer  et  al.,  2008);  EPCAM,  an 
epithelial  cell  adhesion  molecule  involved  in  cancer  and  stem 
cells  signaling  (Munz  et  al.,  2009);  and  CES7,  known  to  be  ex¬ 
pressed  only  in  the  male  reproductive  tract  (Gang  et  al.,  2011) 
(Figure  3B  and  Table  S6).  Among  the  prostate  cancer  specific 
pseudogenes,  CXADR -*T,  a  processed  pseudogene  on  chromo¬ 
some  15,  was  of  immediate  interest,  as  the  parental  CXADR 
protein  demonstrates  putative  tumor  suppressor  functions  and 
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its  loss  is  implicated  in  a-catenin  silencing  (Pong  et  al.,  2003).  We 
therefore  selected  this  pseudogene  for  further  study  in  prostate 
cancer  and  first  evaluated  custom  Taqman  assays  that  could 
distinguish  CXADR-W  from  parental  CXADR.  The  expression 
levels  showed  strong  correlation  with  the  RNA-Seq  data  (Fig¬ 
ure  S3E).  CXADR-m  expression  was  found  to  be  upregulated 
in  ~25%  of  prostate  cancer  tissues,  with  minimal  expression 
seen  in  benign  prostate  samples  and  nonprostate  tissues  (Fig¬ 
ure  5A).  No  correlation  was  observed  between  CXADR and 
parental  CXADR  expression,  although  parental  CXADR  also 
had  some  proclivity  for  prostate  cancer-specific  expression  (Fig¬ 
ure  5B).  Interestingly,  CXADR-W  expression  was  nearly 
restricted  to  prostate  cancers  lacking  an  ETS  gene  fusion,  with 
few  ETS-positive  samples  exhibiting  expression  of  this  pseudo¬ 
gene.  By  contrast,  parental  CXADR  gene  expression  was  found 
in  both  ETS-positive  and  ETS-negative  samples  (Figure  5C). 
Finally,  we  interrogated  CXADR-m  and  CXADR  parental  gene 
expression  in  a  set  of  six  prostate  patients  with  matched  cancer 
and  benign  tissues  (including  four  ETS-negative  and  two  ETS- 
positive  pairs).  Again,  ETS-negative  prostate  cancer  samples 
displayed  marked  upregulation  of  CXADR-ty  compared  to  the 
ETS-positive  patients,  with  parental  CXADR  expression  being 
fairly  constant  between  this  set  of  patients  (Figure  5D).  To  estab¬ 
lish  the  expression  of  CXADR-W  transcript,  we  were  able  to 
clone  CXADR-ty  cDNA  from  two  RNA-Seq-positive  prostate 
cancer  samples  (Figure  S5A),  and  as  predicted,  these  clones 
showed  perfect  sequence  similarity  to  the  pseudogene 
CXADR-m  and  only  84%  to  CXADR  wild-type  gene  (Figure  S5B). 

In  the  course  of  these  analyses,  we  also  identified  a  prostate- 
cancer-specific  readthrough  transcript  involving  KLK4,  an 
androgen-induced  gene,  and  KLKP1 ,  an  adjacent  pseudogene. 
This  chimeric  RNA  transcript  KLK4-KLKP1 ,  combining  the  first 
two  exons  of  KLK4  with  the  last  two  exons  of  KLKP1 ,  retains 
an  open  reading  frame  incorporating  54  amino  acids  encoded 
by  the  KLKP1  pseudogene  in  the  putative  chimeric  protein  (Fig¬ 
ure  S6A).  Curiously,  this  readthrough  was  recently  described  in 
the  prostate  cancer  cell  line  LNCaP  as  a  c/s  sense-antisense 
chimeric  transcript  (Lai  et  al.,  2010).  Intriguingly,  the  KLK4- 
KLKP1  transcript  was  highly  expressed  in  30%-50%  of  prostate 
cancer  tissues,  and  this  expression  was  lineage  and  cancer 
specific,  with  minimal  expression  seen  in  benign  prostate  and 
other  tissues  (Figure  S6B).  These  data  suggest  that  the  KLK4- 
KLKP1  may  warrant  further  study  as  a  potential  biomarker  of 
prostate  cancer  as  well  as  a  candidate  protein  implicated  in 
the  biological  complexity  of  this  disease. 


Figure  4.  Cancer-Specific  Pseudogene  Expression  Profiles 

(A)  Heatmap  of  pseudogene  expression  sorted  according  to  cancer-specific 
expression  patterns  displays  pseudogene  transcripts  specific  to  individual 
cancers  (top),  common  across  multiple  cancers  (tissue-enriched;  middle),  and 
nonspecific  (bottom). 

(B)  Zoomed-in  version  of  the  top  panel  displaying  individual  cancer-specific 
expressed  pseudogenes.  The  columns  represent  different  tissues  with  the 
number  of  samples  in  parentheses.  The  rows  represent  individual  clusters 
mapping  to  specific  pseudogenes.  The  color  intensity  represents  the 
frequency  (%)  of  samples  in  a  tissue  type  showing  expression  of  a  given 
pseudogenes  (according  to  the  scale  indicated  at  the  bottom).  The  key  clus¬ 
ters  are  labeled  with  their  corresponding  parental  gene  symbols. 

See  also  Figure  S6  and  Table  S7. 


Breast  Cancer  Pseudogenes 

Among  the  pseudogene  candidates  in  breast  cancer,  we  identi¬ 
fied  a  unprocessed  pseudogene  cognate  to  ATP8A2,  a  LIM 
domain-containing  protein  speculated  to  be  associated  with 
stress  response  and  proliferative  activity  (Khoo  et  al.,  1997) 
(Figure  3A,  top,  and  Table  S3).  Because  ATP8A2-^>  on  chromo¬ 
some  10  displays  substantial  sequence  divergence  from  the 
cognate  ATP8A2-WT  gene  on  chromosome  13,  it  lends  high 
confidence  to  our  computational  identification,  and  we  selected 
this  candidate  for  further  validation.  Taqman  assays  distinguish¬ 
ing  ATP8A2- WT  transcripts  from  ATP8A2-' P  showed  a  strong 
correlation  (r2  =  0.98)  with  the  expression  pattern  obtained 
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Figure  5.  Expression  of  CXADR-W  in  Prostate  Cancer 

(A  and  B)  Histogram  of  expression  values  (y  axis)  of  CXADR-W  (A)  and  CXADR-WT  (B)  across  a  panel  of  tissue  samples  (x  axis).  The  order  of  samples  on  the  x  axis 
is  identical  in  both  graphs  to  facilitate  a  visual  comparison. 

(C)  A  summary  histogram  of  the  expression  values  of  CXADR-m  and  CXADR-WT  in  prostate  cancers  either  harboring  or  lacking  an  ETS  transcription  factor  gene 
fusion  or  in  nonprostate  samples. 

(D)  Expression  of  CXADR-W  and  CXADR-WT  in  matched  pairs  of  tumor  and  benign  samples  from  prostate  cancer  patients.  The  patients’  ETS  status  is  indicated 
by  the  bar  below. 

T,  prostate  cancer;  B,  matched  benign  adjacent  prostate.  The  expression  values  were  normalized  against  GAPDH.  Error  bars  represent  means  ±  SE  of  the  mean. 
See  also  Figure  S5. 
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Figure  6.  Expression  of  ATP8A2-'V  in  Breast  Cancer 

(A  and  B)  Histogram  of  expression  values  (y  axis)  of  ATP8A2-XV  (A)  and  ATP8A2-WT  (B)  across  a  panel  of  tissue  samples  (x  axis).  The  order  of  samples  on  the  x 
axis  is  identical  in  both  graphs  to  facilitate  a  visual  comparison.  (Inset)  A  summary  histogram  of  the  expression  values  of  A7P8A2-W  and  ATP8A2-\NJ  in  breast 
cancer  samples  relative  to  benign  breast  and  other  tissues  (left)  and  luminal  versus  basal  breast  cancer  subtypes  (right).  The  expression  values  were  normalized 
against  GAPDH. 

(C)  Cell  proliferation  assays  following  siRNA  knockdowns  of  ATP8A2-\NT  and  -VP  as  indicated.  NTC,  nontargeting  control;  WT,  siRNA  against  wild-type 
ATP8A2 ;  W,  siRNA  against  ATP8A2-'V. 
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by  RNA-Seq  (Figure  S3E),  with  ATP8A2-' V  expression  found  to 
be  restricted  to  breast  samples,  the  highest  levels  seen  in 
a  subset  of  breast  cancer  tissues  and  cell  lines  (Figures  6A  and 
6B).  By  contrast,  ATP8A2- WT  expression  was  highly  variable 
across  different  tissue  types  and  showed  no  correlation  with 
ATP8A2-' V  expression  (Figure  6B). 

We  were  further  intrigued  by  the  pattern  of  ATP8A2-^>  expres¬ 
sion  within  breast  tumors,  where  ~25%  of  tumors  demonstrate 
extremely  high  levels  of  this  pseudogene,  suggesting  that 
ATP8A2-'i>  may  contribute  to  a  particular  subtype  of  breast 
cancer.  We  therefore  analyzed  ATP8A2-' V  expression  with 
respect  to  luminal  and  basal  breast  subtypes,  two  prominent 
categories  of  breast  cancer  with  distinct  molecular  and  clinical 
characteristics.  Unexpectedly,  we  found  that  ATP8A2-'i> 
expression  was  restricted  to  tumors  with  luminal  histology, 
whereas  basal  tumors  showed  minimal  expression  of  this  pseu¬ 
dogene  (Figure  6A,  right).  The  wild-type  ATP8A2  transcript  did 
not  display  this  pattern  of  expression. 

To  investigate  a  potential  role  of  ATP8A2-' V  expression  in 
breast  cancer,  first  we  carried  out  siRNA-based  knockdown  of 
both  the  wild-type  and  pseudogene  RNA  in  two  independent 
breast  cancer  cell  lines  that  expressed  both  the  transcripts  (Fig¬ 
ure  S7A).  Knockdown  of  ATP8A2-'i>  with  two  independent 
siRNAs  was  found  to  specifically  inhibit  the  proliferation  of  over¬ 
expressing  cell  lines  Cama-1  and  FICC1806  (Figure  6C),  but  not 
the  cell  lines  with  no  detectable  levels  of  A  TP8A2-H1,  for  example, 
the  benign  breast  epithelial  cell  line  FI16N2  (Figure  6C,  right)  and 
a  pancreatic  cancer  cell  line,  BXPC3  (Figure  S7D).  Knockdown  of 
ATP8A2-' P  (but  not  ATP8A2-\NT)  also  resulted  in  reduced  cell 
migration  and  invasion  seen  in  in  vitro  Boyden  Chamber  assays 
(Figure  6D)  as  well  as  in  in  vivo  intravasation  and  metastasis  in 
chicken  chorioailantoic  membrane  xenograft  assay  (Figure  6F). 
In  contrast,  knockdown  of  wild-type  ATP8A2  had  no  effect  on 
the  proliferation  of  any  of  the  cell  lines  tested,  suggesting  an 
unexpected  growth  regulatory  role  for  ATP8A2-m  (Figure  6C). 
Surprisingly,  though  the  knockdown  of  wild-type  ATP8A2  had 
a  minimal  effect  on  the  pseudogene  transcript  levels,  ATP8A2- 
T'-specific  siRNAs,  apart  from  reducing  the  ATP8A2-' P  tran¬ 
script,  also  reduced  the  wild-type  protein  levels  (Figures  S7C 
and  S7E).  Thus  clearly,  unlike  Oct4  and  BRAF  pseudogene  tran¬ 
scripts  having  an  inverse  correlation  with  the  wild-type  transcript 
levels,  ATP8A2-'i>  and  wild-type  ATP8A2  transcripts  (Figures  6A 
and  6B)  and  protein  (Figure  S7E)  do  not  seem  to  be  regulated  in 
this  manner.  Subsequently,  to  assess  the  phenotypic  effect  of 
ATP8A2-X P  overexpression  in  benign  cells,  we  cloned  and  over¬ 
expressed  the  full-length  ATP8A2  pseudogene  cDNA  in  benign 
breast  epithelial  cell  line  TERT-FIMEC.  Two  independent  pooled 
populations  of  ATP8A2->P-overexpressing  TERT-FIMEC  cells 
were  found  to  undergo  increased  proliferation  and  migration 
(Figure  6E),  indicating  the  potential  oncogenic  nature  of  this 
breast-specific  pseudogene  transcript. 


DISCUSSION 

The  recent  advances  in  high-throughput  transcriptome  se¬ 
quencing  have  revealed  widespread  expression  of  noncoding 
RNAs  in  the  context  of  development  and  differentiation  (Kha- 
chane  and  Flarrison,  2010;  Nagalakshmi  et  al.,  2008;  Pickrell 
et  al.,  2010;  Prensner  et  al.,  2011;  Wilhelm  et  al.,  2008).  These 
studies,  however,  do  not  include  pseudogene  expression  anal¬ 
yses  in  their  purview,  likely  due  to  the  challenge  of  extremely 
close  sequence  similarity  with  wild-type  cognate  genes.  Here, 
we  interrogated  the  potential  of  RNA-Seq  data  to  unambiguously 
detect  pseudogene  transcripts  and  to  assess  whether  pseudo¬ 
gene  expression  is  more  common  in  the  transcriptome  than 
previously  realized.  Surprisingly,  we  found  evidence  of  a  wide¬ 
spread  expression  of  pseudogenes  in  our  cancer  transcriptome 
resource,  including  1 ,500  pseudogenes  annotated  in  the  Yale 
and  ENCODE  databases,  redefined  the  genomic  coordinates 
of  ~100  pseudogenes  in  existing  databases,  and  nominated 
more  than  400  potentially  novel  pseudogenes.  In  aggregate, 
our  analysis  considerably  expands  the  spectrum  of  expressed 
pseudogenes  documented  previously  (Harrison  et  al.,  2005; 
Yao  et  al.,  2006;  Zheng  et  al.,  2007). 

The  extreme  sequence  similarity  between  pseudogenes  and 
cognate  wild-type  genes  suggests  a  functional  role  for  pseudo¬ 
gene  transcripts;  indeed,  pseudogene  expression  has  been 
associated  with  both  downregulation  of  cognate  wild-type 
gene,  such  as  eA/OS  in  ovary,  as  well  as  a  positive  effect  on 
the  expression  of  the  wild-type  gene,  as  demonstrated  recently, 
wherein  PTENP1  expression  upregulates  PTEN  expression  in 
prostate  cells  (Poliseno  et  al.,  2010).  Interestingly,  a  class  of 
pseudogenes  called  “unitary  pseudogenes”  does  not  have 
extant  cognate  wild-type  genes  (Zhang  et  al.,  2010).  Neverthe¬ 
less,  as  most  pseudogenes  do  have  distinct  cognate  wild-type 
genes,  we  assessed  the  correlation  between  expressed  pseudo¬ 
genes  and  their  cognate  wild-type  genes  across  multiple 
samples  (of  the  same  tissue  type  or  across  diverse  tissue  types) 
and  did  not  observe  a  statistically  significant  correlation.  This  is 
not  surprising,  partly  because  our  data  set  is  comprised  of 
a  heterogeneous  set  of  samples  representing  diverse  tissue 
types.  Further,  the  sensitivity  of  detection  of  individual  pseudo¬ 
gene  transcripts  is  limited  by  the  degree  and  distribution  of 
dissimilarity  with  the  wild-type  gene  that  determines  the  “effec¬ 
tive”  depth  of  coverage;  this  limits  the  number  of  samples 
showing  measurable  expression  of  individual  pseudogene- 
wild-type  pairs,  making  it  difficult  to  conduct  robust  statistical 
analyses.  Future  studies  involving  larger  sample  sets  with  higher 
depth  of  coverage  and  longer  read  length  may  be  better  able 
to  resolve  this  question. 

Taken  together,  our  study  provides  a  systematic  approach  to 
analyze  expressed  pseudogenes  using  RNA-Seq  data,  enabling 
comparisons  of  cancer  versus  benign  tissues  in  multiple  solid 


(D)  Boyden  chamber  assay  showing  cell  migration  (left)  and  invasion  through  matrigel  (right). 

(E  and  F)  (E)  The  effect  of  ATP8A2-*V  overexpression  in  TERT-HMEC  cells  on  cell  proliferation  (left)  and  cell  migration  based  on  Incucyte  wound  confluency  assay 
(right)  and  (F)  chicken  chorioallantoic  membrane  assay  of  HCC-1 806  cells  treated  with  nontargeting  control  siRNA,  ATP8A2-V1J,  or  ATPA2-W  siRNA  showing 
relative  number  of  cells  intravasated  in  the  lower  CAM  (left)  and  metastatic  cells  in  chicken  lung  (right). 

Error  bars  represent  means  ±  SE  of  the  mean. 
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tumors.  Our  efforts  lend  additional  credence  to  the  capacity  of 
RNA-Seq  to  “re-define”  the  functional  elements  of  the  genome 
and  “re-annotate”  the  population  of  pseudogenes  implicated 
in  human  cell  biology.  Our  approach  overcomes  the  limitations 
of  previous  analyses  of  pseudogene  expression,  which  were 
primarily  anecdotal  and  heterogeneous  in  nature,  and  our  meth¬ 
odologies  suggest  avenues  to  reconcile  the  difficulty  in  distin¬ 
guishing  pseudogene  expression  from  parental  protein-coding 
gene  expression— a  facet  that  is  important  for  all  RNA-Seq 
studies  aiming  to  provide  an  accurate  picture  of  gene  expres¬ 
sion.  Finally,  we  describe  ATP8A2-' V  and  CXADR-W  pseudo¬ 
genes  preferentially  associated  with  distinct  subsets  of  breast 
cancer  and  prostate  cancer  patients,  respectively. 

The  recent  description  of  intricate  regulatory  networks  of 
protein-coding  transcripts  called  competitive  endogenous 
RNAs  (ceRNAs)  defined  on  the  basis  of  coordinated  regulation 
by  common  sets  of  microRNA  response  elements  (MREs)— first 
intimated  by  Salmena  et  al.  (Salmena  et  al.,  2011)  and  sub¬ 
sequently  supported  by  experimental  results  from  multiple 
groups(Cesana  et  al.,  2011;  Han  et  al.,  2011b;  Karreth  et  al., 
2011;  Tay  et  al.,  201 1)— implicates  potential  noncoding  func¬ 
tions  for  many  protein-coding  transcripts.  In  this  context, 
pseudogene  transcripts  could  provide  an  additional  layer  of 
complexity  in  conjunction  with  their  cognate  wild-type  genes 
or  independently. 

The  cancer/tissue-specific  pseudogene  expression  signa¬ 
tures  described  here  highlight  the  need  to  factor  in  pseudogene 
expression  in  all  high-throughput  gene  expression  studies  and 
also  show  that  pseudogene  expression  merits  further  explora¬ 
tion  in  its  own  right  as  an  additional  layer  of  transcriptional 
complexity.  To  facilitate  further  analyses,  we  provide  here  an 
extensive  resource  of  RNA-Seq  data  of  human  cancer-related 
tissues  and  cell  lines. 

EXPERIMENTAL  PROCEDURES 
Data  Set 

Paired-end  transcriptome  sequence  reads  (2  x  40  and  2  x  80  base  pairs) 
were  obtained  from  a  total  of  more  than  293  samples  from  1 3  tissue  types  (Fig¬ 
ure  SI  and  Table  SI).  Each  sample  was  sequenced  on  an  lllumina  Genome 
Analyzer  I  or  II  according  to  protocols  provided  by  lllumina  as  described  earlier 
(Palanisamy  et  al.,  2010). 

Pseudogene  Analysis  Pipeline 

Paired-end  transcriptome  reads  were  mapped  to  the  human  genome 
(NCBI36/hg18)  and  University  of  California  Santa  Cruz  (UCSC)  Genes  using 
Efficient  Alignment  of  Nucleotide  Databases  (ELAND)  software  of  the  lllumina 
Genome  Analyzer  Pipeline,  using  32  bp  seed  length  and  allowing  up  to  two 
mismatches;  detailed  mapping  status  is  represented  in  Table  S2.  Passed 
purity  filter  reads  obtained  from  lllumina  export  and  extended  output  files  (as 
described  before)  were  parsed  and  binned  into  three  major  categories:  (1) 
both  of  the  paired  reads  map  to  annotated  genes;  (2)  one  or  both  of  the  paired 
reads  map  to  unannotated  regions  in  the  genome;  and  (3)  neither  of  the  reads 
map  (these  include  viral,  bacterial,  and  other  contaminant  reads,  as  well  as 
sequencing  errors).  The  paired  reads  with  one  or  both  partners  mapping  to 
an  unannotated  region  were  clustered  based  on  overlaps  of  aligned 
sequences  using  the  chromosomal  coordinates  of  the  clusters.  Singleton 
reads  that  did  not  cluster  or  stacked\duplicated  reads  with  the  same  start 
and  stop  genomic  coordinates  (potential  PCR  artifacts)  were  filtered  out. 
Passed  filter  clusters  were  defined  as  units  of  transcript  expression  (analogous 
to  a  “probe”  on  microarray  platforms).  These  clusters  were  screened  against 


two  human  pseudogene  resources,  Yale  human  pseudogene  (Build  53,  http:// 
pseudogene.org/)  (Karro  et  al.,  2007)  and  Gencode  (October  2009,  http:// 
genome.ucsc.edu/ENCODE/)  (Zheng  et  al.,  2007),  to  identify  and  annotate 
pseudogene  clusters.  The  processed,  duplicated,  and  fragmented  categories 
of  pseudogene  entries  from  Yale  and  the  entries  corresponding  to  Level  1+2 
(Manual  Gene  Annotations)  and  Level  3  (Automated  Gene  Annotations)  from 
Gencode  were  used.  The  clusters  were  also  subjected  to  homology  search 
using  the  alignment  tool  BLAT  (http://www.soe.ucsc.edu/~kent)  (Kent,  2002) 
for  an  independent  annotation.  Sequence  reads  from  individual  samples 
were  queried  against  the  resultant  clusters  defined  by  the  union  of  Yale, 
ENCODE,  and  BLAT  output  to  assess  the  expression  of  pseudogenes  (Figure  1 
and  Table  S3).  The  cutoff  value  for  pseudogene  expression  in  a  sample  was  set 
at  five  or  more  reads  mapping  to  at  least  one  cluster  in  a  putative  pseudogene 
transcript.  Pseudogene  transcripts  (one  or  more  probes  overlapping  with 
either  Yale  or  ENCODE)  detected  in  two  or  more  samples  in  a  tissue  type 
and  absent  in  all  other  tissue  types  were  defined  as  tissue/lineage  specific. 
Pseudogene  probes  detected  in  10  out  of  13  samples  were  designated  as 
ubiquitous.  All  other  cases  were  described  as  an  intermediate  category.  Pseu¬ 
dogene  transcripts  detected  in  three  or  more  cancer  samples  and  absent  in  all 
benign  samples  were  designated  as  cancer  specific. 

We  carried  out  multiple  correlation  analyses  (Figure  S3),  including:  (1) 
passed  filter  reads  (sequence  yield)  with  total  number  of  pseudogene  tran¬ 
scripts  observed  per  sequencing  run  (pseudogene  transcript  coverage);  (2) 
expression  of  genes  and  pseudogenes  carried  out  using  173  gene-pseudo- 
gene  pairs  in  64  samples  that  each  show  nonzero  expression  in  at  least  ten 
samples  across  the  data  set;  (3)  expression  levels  of  ATP8A2  and  CXADR 
pseudogenes  obtained  from  RNA-Seq  and  qPCR;  (4)  ChIP-Seq  analysis  of 
a  breast  cancer  cell  line  MCF7  that  was  probed  with  H3K4me3  and  compared 
with  MCF7  pseudogene  transcript  data;  and  (5)  pseudogene  transcripts  with  3' 
UTR  sequences  (±  2  kb)  that  were  compared  with  3'  UTR  sequences  of  their 
cognate  genes  using  BLAT. 

Pseudogene  transcripts  showing  an  overlap  with  transcripts  involved  in 
ceRNA  network  genes  reported  previously  were  tabulated  (Sumazin  et  al., 
2011  and  Tay  et  al.,  2011)  (Table  S5).  The  entire  sequence  data  set  will  be 
submitted  to  dbGAP  after  securing  requisite  approvals. 

RNA  Isolation  and  cDNA  Synthesis 

Total  RNA  was  isolated  using  Trizol  and  an  RNeasy  Kit  (Invitrogen)  with  DNase  I 
digestion  according  to  the  manufacturer’s  instructions.  RNA  integrity  was 
verified  on  an  Agilent  Bioanalyzer  2100  (Agilent  Technologies,  Palo  Alto,  CA). 
cDNA  was  synthesized  from  total  RNA  using  Superscript  III  (Invitrogen)  and 
random  primers  (Invitrogen). 

Quantitative  Real-Time  PCR 

Quantitative  real-time  PCR  (qPCR)  was  performed  using  Taqman  or  SYBR 
green-based  assays  (Applied  Biosystems,  Foster  City,  CA)  on  an  Applied 
Biosystems  7900HT  Real-Time  PCR  System,  according  to  standard  protocols. 
The  Taqman  assays  for  CXADR  and  ATP8A2  assays  were  custom  designed 
based  on  regions  of  differences  between  the  wild-type  and  pseudogene 
sequences  (Figure  S4).  Oligonucleotide  primers  for  SYBR  green  assays  were 
obtained  from  Integrated  DNA  Technologies  (Coralville,  IA).  The  housekeeping 
gene  GAPDH  was  used  as  a  loading  control.  Fold  changes  were  calculated 
relative  to  GAPDH  and  normalized  to  the  median  value  of  the  benign  samples. 

CXADR- W_F  CGGTTTCAGTGCTCTATGTTGTTTG;  CXADR- W_R  TAAATT 
TAGGATTACATGTTTCTAGAACA;  CXADR- W_M  6FAM  ATGCCATCCAA 
AACCA;  A7P8A2-W_F  CTGGTGTTCTTTGGCATCTACTCA;  ATP8A2-'V_ R 
CAGCT CAGGAT CACAGTT GCT ;  ATP8A2-W_ M  6FAM  CTGGTCCACCATT 
CTC;  A7P8A2-WT_F  AT CCT ATT G AAG GAG G ACT CTTT G G A;  A  7R8A2- WT_R 
CCAGCAAATTCCCAAGGTCAGT ;  ATP8A2-WJJA  6FAM  AAGGGCAGCCAT 
TACT;  KLK4-KLKP1 JF  ATGGAAAACGAATTGTTCTG;  and  KLK4-KLKP1  _R 
C  AGT  GTT  CCGGGT  G  ATGCAG . 

Additionally,  inventoried  Taqman  assays  for  CXADR-WT  (HsOOl  54661  _m1) 
and  ATP8A2-WT  (assay  ID  hs00185259_m1)  were  used. 

RT-PCR  and  Sanger  Sequencing 

Sequence  stretches  unique  to  pseudogene  transcripts  were  identified  by 
aligning  the  candidate  pseudogene  sequences  with  their  corresponding 
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wild-type  genes.  PCR  primers  specific  to  pseudogene  transcripts  (Table  S4) 
were  used  to  amplify  pseudogene  cDNAs  from  index  samples  followed  by 
Sanger  sequencing  of  the  PCR  products.  The  resultant  sequences  were 
analyzed  using  ClustalW  to  compare  the  identity  between  pseudogene  and 
cognate  wild-type  sequences. 

Cell  Proliferation  Assays 

Experimental  cells  were  transfected  with  siRNAs  using  oligofectamine  reagent 
(Life  Sciences),  and  3  days  posttransfection,  the  cells  were  plated  for  prolifer¬ 
ation  assays.  At  the  indicated  times,  cell  numbers  were  measured  using 
Coulter  Counter. 

Wound  Healing  Assay  Using  Incucyte 

For  the  wound  healing  assay,  vector  control  or  ATP8A2  pseudogene¬ 
overexpressing  cells  were  plated  at  high  density,  and  6  hr  later,  uniform 
scratch  wounds  were  made  using  Woundmaker  (Incucyte).  Relative  migration 
potential  of  the  cells  was  assessed  by  confluence  measurements  at  regular 
time  intervals  as  indicated  over  the  wound  area. 

ATP8A2  Pseudogene  Overexpression  Studies 

The  ATP8A2  pseudogene  cDNA  from  breast  cancer  cell  line  HCC1806  was 
cloned  into  pENTR-D-TOPO  entry  vector  (Invitrogen)  following  manufacturer’s 
instructions.  Sequence-confirmed  entry  clones  in  correct  orientation  were 
recombined  into  Gateway  pcDNA-DEST26  mammalian  expression  vector 
(Invitrogen)  by  LR  Clonase  II  enzyme  reaction  following  manufacturer’s 
instructions.  HMEC-TERT  cells  were  transfected  using  Fugene  6,  and  poly¬ 
clonal  populations  of  cells  expressing  ATP8A2  pseudogene  cDNA  or  empty 
vector  constructs  were  selected  using  geneticin.  At  the  indicated  times,  cell 
numbers  were  measured  using  Coulter  Counter. 

Chicken  Chorioallantoic  Membrane  Assay 

Chicken  chorioallantoic  membrane  (CAM)  assay  for  tumor  growth  was  carried 
out  as  follows.  Fertilized  eggs  were  incubated  in  a  humidified  incubator  at  38°C 
for  10  days,  and  then  CAM  was  dropped  by  drilling  two  holes:  a  small  hole 
through  the  eggshell  into  the  air  sac  and  a  second  hole  near  the  allantoic 
vein  that  penetrates  the  eggshell  membrane  but  not  the  CAM.  Subsequently, 
a  cutoff  wheel  (Dremel)  was  used  to  cut  a  1  cm2  window  encompassing  the 
second  hole  near  the  allantoic  vein  to  expose  the  underlying  CAM.  When 
ready,  CAM  was  gently  abraded  with  a  sterile  cotton  swab  to  provide  access 
to  the  mesenchyme,  and  2  x  1 06  cells  in  50  \i\  volume  were  implanted  on  top. 
The  windows  were  subsequently  sealed  and  the  eggs  returned  to  the  incu¬ 
bator.  After  7  days,  extraembryonic  tumors  were  isolated  and  weighed.  Five 
to  ten  eggs  per  group  were  used  in  each  experiment. 

SUPPLEMENTAL  INFORMATION 

Supplemental  Information  includes  seven  figures  and  seven  tables  and  can  be 
found  with  this  article  online  at  doi:1 0.1 01 6/j. cell. 201 2.04.041 . 
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Abstract 

Application  of  high-throughput  transcriptome  sequencing  has  spurred  highly  sensitive  detection  and  discovery  of 
gene  fusions  in  cancer,  but  distinguishing  potentially  oncogenic  fusions  from  random,  "passenger"  aberrations 
has  proven  challenging.  Here  we  examine  a  distinctive  group  of  gene  fusions  that  involve  genes  present  in  the 
loci  of  chromosomal  amplifications — a  class  of  oncogenic  aberrations  that  are  widely  prevalent  in  breast  cancers. 
Integrative  analysis  of  a  panel  of  14  breast  cancer  cell  lines  comparing  gene  fusions  discovered  by  high-throughput 
transcriptome  sequencing  and  genome-wide  copy  number  aberrations  assessed  by  array  comparative  genomic 
hybridization,  led  to  the  identification  of  77  gene  fusions,  of  which  more  than  60%  were  localized  to  amplicons 
including  1 7q1 2,  17q23,  20q13,  chr8q,  and  others.  Many  of  these  fusions  appeared  to  be  recurrent  or  involved 
highly  expressed  oncogenic  drivers,  frequently  fused  with  multiple  different  partners,  but  sometimes  displaying 
loss  of  functional  domains.  As  illustrative  examples  of  the  "amplicon-associated"  gene  fusions,  we  examined 
here  a  recurrent  gene  fusion  involving  the  mediator  of  mammalian  target  of  rapamycin  signaling,  RPS6KB1  kinase 
in  BT-474,  and  the  therapeutically  important  receptor  tyrosine  kinase  EGFR  in  MDA-MB-468  breast  cancer  cell  line. 
These  gene  fusions  comprise  a  minor  allelic  fraction  relative  to  the  highly  expressed  full-length  transcripts  and 
encode  chimera  lacking  the  kinase  domains,  which  do  not  impart  dependence  on  the  respective  cells.  Our  study 
suggests  that  amplicon-associated  gene  fusions  in  breast  cancer  primarily  represent  a  by-product  of  chromosomal 
amplifications,  which  constitutes  a  subset  of  passenger  aberrations  and  should  be  factored  accordingly  during 
prioritization  of  gene  fusion  candidates. 
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Introduction 

Chromosomal  amplifications  and  translocations  are  among  the  most 
common  somatic  aberrations  in  cancers  [1,2].  Gene  amplification  is 
an  important  mechanism  for  oncogene  overexpression  and  activation. 
Numerous  recurrent  loci  of  chromosomal  amplifications  have  been 
characterized  in  breast  cancer,  which  result  in  gain  of  copy  number 
and  overexpression  of  oncogenes  such  as  ERBB2  on  17ql2  (the  defin¬ 
itive  molecular  aberration  in  20%-30%  of  all  breast  cancers)  [3,4],  as 
well  as  many  other  oncogenic  drivers  including  Myc  [5],  EGFR  [6], 
FGFR1  [7],  CyclinDl  [8],  RPS6KB1  [9],  and  others  [10].  Chromo¬ 
somal  translocations  leading  to  generation  of  gene  fusions  represent 
another  prevalent  mechanism  for  the  expression  of  oncogenes  in  epi¬ 
thelial  cancers  [11],  Recently,  we  described  the  discovery  and  charac¬ 
terization  of  recurrent  gene  fusions  in  breast  cancer  involving  MAST 
family  serine  threonine  kinases  and  Notch  family  of  transcription  factors 
[12].  Interestingly,  we  also  observed  a  large  number  of  gene  fusions, 
including  some  recurrent  fusions  involving  known  oncogenes  localized 
at  loci  of  chromosomal  amplifications. 

Here  we  carried  out  a  systematic  analysis  of  the  association  between 
gene  fusions  and  genomic  amplification  by  integrating  RNA-Seq  data 
with  array  comparative  genomic  hybridization  (aCGH)-based  whole- 
genome  copy  number  profiling  from  a  panel  of  breast  cancer  cell  lines. 
We  examined  a  set  of  “amplicon-associated  gene  fusions”  that  refer  to 
all  the  fusions  where  one  or  both  gene  partners  are  localized  to  a  site  of 
chromosomal  amplification.  Specifically,  we  assessed  the  functional  rel¬ 
evance  of  two  amplicon-associated  fusion  genes  involving  oncogenic 
kinases  EGFR  and  RPS6KB1  in  the  context  of  prioritizing  fusion  can¬ 
didates  important  in  tumorigenesis.  Our  results  suggest  that  recurrent 
gene  fusions  localized  to  recurrent  amplicons,  displaying  allelic  imbal¬ 
ance  between  the  fusion  partners,  may  represent  an  epiphenomenon  of 
genomic  amplification  cycles  not  essential  for  cancer  development. 

Materials  and  Methods 

Gene  Fusion  Data  Set 

Chimeric  transcript  candidates  were  primarily  obtained  from 
paired-end  transcriptome  sequencing  of  breast  cancer  from  a  total 
of  more  than  49  cell  lines  and  40  tissue  samples  described  previously 
[12],  aCGH  data  were  generated  using  Agilent  Human  Genome 
244A  CGH  Microarrays  (Agilent  Technologies,  Santa  Clara,  CA) 
according  to  the  manufacturer’s  instructions,  and  data  were  analyzed 
using  CGH  Analytics  (Agilent  Technologies).  Copy  number  alterations 
were  assessed  using  ADM-2,  with  the  threshold  a  setting  of  6.0  and 
a  bin  size  of  10. 

RNA  Isolation  and  Complementary  DNA  Synthesis 

Total  RNA  was  isolated  using  TRIzol  and  RNeasy  Kit  (Invitrogen, 
Carlsbad,  CA)  with  DNase  I  digestion  according  to  the  manufacturer’s 
instructions.  RNA  integrity  was  verified  on  an  Agilent  Bioanalyzer  2100 
(Agilent  Technologies).  Complementary  DNA  was  synthesized  from  total 
RNA  using  Superscript  III  (Invitrogen)  and  random  primers  (Invitrogen). 

Quantitative  Real-time  Polymerase  Chain  Reaction 

Primers  for  validation  of  candidate  gene  fusions  were  designed  using 
the  National  Center  for  Biotechnology  Information  Primer  Blast 
(http://www.ncbi.nlm.nih.gov/tools/primer-blast/),  with  primer  pairs 
spanning  exon  junctions  amplifying  70-  to  1 10-bp  products  for  every 
chimera  tested.  Quantitative  polymerase  chain  reaction  (QPCR)  was  per¬ 
formed  using  SYBR  Green  MasterMix  (Applied  Biosystems,  Carlsbad, 


CA)  on  an  Applied  Biosystems  StepOne  Plus  Real-Time  PCR  System. 
All  oligonucleotide  primers  were  obtained  from  Integrated  DNA  Tech¬ 
nologies  and  are  listed  in  Table  Wl.  GAPDH  was  used  as  endogenous 
control.  All  assays  were  performed  twice,  and  results  were  plotted  as 
average  fold  change  relative  to  GAPDH. 

Cell  Proliferation  Assays 

Cells  were  transfected  with  small  interfering  RNAs  (siRNAs)  using 
Oligofectamine  reagent  (Life  Sciences,  Carlsbad,  CA),  and  3  days 
after  transfection,  the  cells  were  plated  for  proliferation  assays.  At  the 
indicated  times,  cell  numbers  were  counted  using  Coulter  Counter 
(Indianapolis,  IN). 

Western  Blot 

Cell  pellets  were  sonicated  in  NP-40  lysis  buffer  (50  mM  Tris- 
HC1,  1%  NP-40,  pH  7.4;  Sigma,  St.  Louis,  MO),  complete  protease 
inhibitor  mixture  (Roche,  Indianapolis,  IN),  and  phosphatase  in¬ 
hibitor  (EMD  Bioscience,  San  Diego,  CA).  Immunoblot  analysis  was 
carried  out  using  antibodies  for  ERBB2  (MS-730-PABX;  Thermo 
Scientific,  Fremont,  CA)  and  RPS6KB1  (2708S;  Cell  Signaling,  Danvers, 
MA).  Human  p-actin  antibody  (Sigma,  St.  Louis,  MO)  was  used  as  a 
loading  control. 

Knockdown  Assays 

Short  hairpin  RNAs  (shRNAs;  Table  Wl)  were  transduced  in 
presence  of  1  pg/ml  polybrene.  All  siRNA  transfections  were  performed 
using  Oligofectamine  reagent  (Life  Sciences).  For  siRNA  knockdown 
experiments,  multiple  custom  siRNA  sequences  targeting  th eARIDlA- 
MAST2  fusion  (Thermo,  Lafayette,  CO)  were  used  [12], 

Results 

Paired-end  transcriptome  sequencing  of  breast  cancer  cell  lines  and 
tissues  led  to  the  identification  of  an  average  of  more  than  four  gene 
fusions  per  breast  cancer  sample  [12],  Interestingly,  we  observed  that 
some  of  the  cell  lines  with  the  largest  number  of  gene  fusions  also 
harbored  many  well-known  chromosomal  amplifications,  prompting 
us  to  examine  a  likely  association  between  genomic  amplifications 
and  gene  fusions.  To  assess  copy  number  alterations  at  the  chromo¬ 
somal  coordinates  of  the  fusion  genes,  we  analyzed  aCGH  (244K 
Agilent  array)  data  in  a  set  of  14  cell  lines  (Table  W2)  and  observed 
that  as  many  as  62%  of  the  total  number  of  fusions  were  associated 
with  regions  of  amplifications  (Figure  1  A).  The  genes  involved  in 
fusions  were  found  to  be  significantly  associated  with  their  genomic 
amplification  status  based  on  Fisher  exact  t  test  ( P  <  .0004),  in  four 
of  six  cell  lines  with  the  maximum  number  of  fusions,  including  BT- 
474,  MCF7,  HCC2218,  and  UACC893  (Figure  15). 

Examining  the  distribution  of  fusion  genes  in  individual  samples 
revealed  that  a  majority  of  the  gene  fusions  were  associated  with 
17ql2  amplicon  harboring  ERBB2  and  17q23  amplicon  that  includes 
genes  such  as  BCAS3,  RPS6KB1,  and  TMEM49,  20ql3  amplicon  with 
BCAS4  and  the  chr8q  amplicon  commonly  found  amplified  in  breast 
cancer  (Table  W2  and  Figures  2  and  Wl).  Interestingly,  the  breast 
cancer  cell  line  BT-474  that  harbors  both  the  chrl7  amplicons  and 
the  chr20  amplicon  and  MCF7  with  prominent  amplifications  in 
chrl7,  chr20,  and  chr8q  showed  the  maximum  number  of  gene  fusions 
observed  in  a  sample,  accounting  for  as  many  as  26  gene  fusions  asso¬ 
ciated  with  amplicons  compared  against  only  9  in  unamplified  loci 
(Figures  1  and  2  and  Table  W2). 
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Figure  1.  Distribution  of  gene  fusions  across  breast  cancer  cell  lines.  (A)  Pie  chart  representation  of  the  relative  proportion  of  gene  fusions 
associated  with  loci  of  genomic  amplifications  compared  to  unamplified  loci  (left)  and  bar  graph  representation  of  the  relative  distribution 
of  gene  fusions  across  different  breast  cancer  cell  lines  (right).  (B)  Table  summarizing  the  statistical  significance  of  association  between 
gene  fusions  and  chromosomal  amplifications  in  breast  cancer  cell  lines  with  the  highest  number  of  gene  fusions  in  A  (using  Fisher  exact 
t  test,  sorted  by  P  value). 


In  the  backdrop  of  a  large  number  of  somatic  aberrations  seen  in 
cancers,  any  “recurrent”  events  observed  across  samples  are  generally 
regarded  as  potentially  “driving”  tumorigenesis.  Interestingly,  among 
the  more  than  380  gene  fusions  reported  in  our  compendium  of  breast 
cancer  fusions  [12],  as  many  as  62  genes  were  found  to  be  recurrent 
partners  (appear  at  least  twice).  Among  these,  whereas  the  MAST 
and  Notch  fusions  were  shown  to  be  functionally  recurrent  and  poten¬ 
tially  driving  aberrations  in  up  to  5%  to  7%  of  breast  cancers,  33  of 
other  recurrent  gene  fusions  were  found  to  be  associated  with  known 
frequent  amplicons,  including  ERBB2,  BCAS3/4,  and  chr8q.  Among 
these,  three  fusions  each  involved  the  ikaros  family  zinc  linger  protein 
3  transcription  factor  ( IKZF3  on  chrl7ql2  amplicon)  and  breast 
carcinoma  amplified  sequence  3  ( BCAS3  on  chrl7q23  amplicon)  as 
3'  partners — all  with  different  5'  partners.  Similarly,  tripartite  motif 
containing  37  ( TRIM37 on  chrl7q23)  was  a  common  5'  partner  in  three 
distinct  gene  fusions  with  different  3'  partners  (Table  W2).  To  further 
expand  our  integrative  analysis  of  copy  number  aberrations  and  gene 
fusions,  next  we  used  the  breast  cancer  aCGH  data  [13,14]  and  ob¬ 
served  gene  fusion-associated  amplicons  in  MCF7,  BT-474,  and  MDA- 
MB-468,  HCC-1187  as  seen  in  our  data  as  well  as  in  an  additional 
panel  of  cell  lines,  including  ZR-75-30,  SUM190,  MDA-MB-361, 
HCC-1428,  and  HCC-1569  (Figure  W2).  Clearly,  apart  from  trig¬ 
gering  overexpression  of  constituent  genes,  our  observations  strongly 
suggest  that  the  loci  of  chromosomal  amplifications  also  serve  as  “hot¬ 
spots”  for  the  generation  of  recurrent  gene  fusions. 

Next,  to  assess  whether  amplicon-associated  gene  fusions  impart 
oncogenic  phenotypes  on  the  cells,  we  examined  the  open  reading 


frames  (ORFs),  functional  domains/motifs,  and  conservation  of  fu¬ 
sion  architecture  across  different  samples.  Among  recurrent  fusion 
candidates  within  amplicons,  we  focused  on  known  cancer-associated 
partner  genes  such  as  kinases,  oncogenes,  tumor  suppressors,  or  known 
fusion  partners  in  the  Mitelman  Database  of  chromosomal  aberrations 
in  cancer  [15]  and  observed  several  functionally  plausible  gene  fusions. 
Here  we  describe  our  observations  with  two  specific  examples  of  gene 
fusions  involving  oncogenic  kinases. 

The  triple-negative  breast  cancer  cell  line  MDA-MB-468  is  known 
to  show  an  overexpression  of  epidermal  growth  factor  receptor  (EGFR) 
[16].  In  our  transcriptome  sequencing  compendium  of  89  breast  cancer 
cell  lines  and  tissues,  the  highest  expression  of  EGFR  is  observed  in 
MDA-MB-468  (Figure  3 A),  potentially  resulting  from  a  focal  amplifi¬ 
cation  at  chr7pl2  (Figure  2).  In  addition,  we  detected  an  EGFR  fusion 
transcript  ( EGFR-POLD1 )  in  this  cell  line,  encoding  the  N-terminal 
portion  of  EGFR,  completely  devoid  of  the  tyrosine  kinase  domain 
(Figure  3 A,  inset).  However,  the  uniform  read-coverage  observed  across 
the  full  length  of  the  EGFR  transcript  in  this  sample  (Figure  3 B),  pre¬ 
cluded  the  existence  of  any  exon  imbalance,  suggesting  that  even  as  the 
kinase  domain  is  lost  in  the  fusion,  the  full-length  EGFR  protein  is 
expressed  in  this  cell  line.  Further,  we  observed  a  remarkable  mismatch 
between  the  copy  numbers  of  EGFR  and  its  fusion  partner  POLD1 
(Figure  3  C)  that  supports  a  predominant  expression  of  full-length 
EGFR  compared  with  the  EGFR-POLD1  chimera.  This  is  unlike  the 
observation  in  case  of  MAST  kinase  fusions  in  breast  cancer  charac¬ 
terized  in  our  previous  study  [12],  in  which  case  a  marked  exon  im¬ 
balance  in  coverage  was  observed  (Figure  W3).  Considering  that  the 
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MDA-MB-468  harbors  both  MAST2  and  EGFR  fusions,  we  were  in¬ 
trigued  to  assess  its  relative  “dependence”  on  both  the  kinases.  Surpris¬ 
ingly,  a  profound  reduction  in  cell  proliferation  was  observed  on  siRNA 
knockdown  of  MAST2,  whereas  EGFR  knockdown  showed  little  effect 
(Figure  3D).  Next,  testing  the  possibility  of  EGFR  amplicon  potentially 
cooperating  with  MAST2,  we  found  that  the  effect  of  combined 
knockdown  of  EGFR  and  MAST2  was  comparable  with  that  of 
MAST2  knockdown  alone  (Figure  3D),  further  suggesting  that  EGFR 
amplification  does  not  signify  a  driver  aberration.  In  this  context,  the 
EGFR  fusion  transcript  that  represents  a  miniscule  fraction  of  overall 
EGFR  expression  and  encodes  only  the  N-terminal  portion  lacking  the 
kinase  domain  was  reckoned  to  be  inconsequential. 

Next,  we  looked  at  recurrent  gene  fusions  involving  oncogenic 
serine  threonine  kinase  ribosomal  protein  S6  kinase  on  chrl7q23  fre¬ 
quently  amplified  in  breast  cancers  [17-20]  identified  in  BT-474 


BT-474  (n=1 7) 


HCC-2218  (n=9) 


(. RPS6KB1-SNF8 )  and  MCF7  {RPS6KB1  -  VMP1 ) .  Both  of  these  cell 
lines  harbor  amplifications  at  the  RPS6KB1  locus  and  express  the  high¬ 
est  levels  of  RPS6KB1  among  all  the  samples  examined  (Figure  4A). 
Both  the  chimeric  transcripts  retain  only  the  first  exon  of  RPS6KB1 
and  the  respective  open  reading  frames  show  a  complete  loss  of  the 
kinase  domain  (Figure  AA,  inset).  We  also  observed  an  even  read  cov¬ 
erage  across  the  RPS6KB1  transcript  in  both  fusion-positive  cell  lines, 
similar  to  a  representative  benign  mammary  epithelial  cell  line,  albeit 
at  a  much  higher  level,  indicating  that  full-length  RPS6KB1  protein 
is  encoded  in  these  samples  (Figures  AB  and  W 4A).  Further,  the  differ¬ 
ence  between  the  copy  number  observed  between  the  fusion  partners  in 
both  the  RPS6KB1  fusions  (Figures  4C  and  W AB)  indicates  an  allelic 
imbalance  between  the  full-length  and  the  putative  fusion  genes.  Next, 
considering  that  BT-474  is  an  ERBB2- positive  cell  line,  we  tested  po¬ 
tential  dependence  of  these  cells  on  the  RPS6KB1  protein.  Surprisingly, 


MCF7  (n=18) 


MDA-MB-468  (n=4) 


Figure  2.  Graphical  representation  of  integrative  analysis  of  gene  fusions  with  copy  number  analysis.  Circos  plots  of  the  genome-wide 
distribution  of  gene  fusions  along  with  status  of  copy  number  alterations.  Red  and  green  peaks  represent  amplifications  and  deletions; 
purple  and  cyan  lines  represent  the  fusions  associated  with  amplicons  and  nonamplicons,  respectively,  "n"  refers  to  the  total  number  of 
fusions  identified. 
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MDA-MB-468 


Figure  3.  (A)  Normalized  expression  (RPKM)  of  EGFR  in  descending  order  of  expression  in  a  panel  of  breast  cancer  samples  obtained 
from  RNA-Seq.  Schematic  representation  of  wild-type  EGFR  and  POLD1  proteins  with  putative  breakpoints  indicated  by  red  arrows  and 
the  domain  structure  of  the  putative  fusion  protein  (inset).  (B)  Plot  of  normalized  coverage  of  EGFR  transcript  in  MDA-MB-468  cell  line 
showing  the  location  of  the  breakpoint  (indicated  by  red  arrow).  (C)  Bar  graph  representing  the  copy  number  of  EGFR  and  POLD1  in 
MDA-MB-468.  (D)  Proliferation  assay  showing  absolute  cell  count  (/  axis)  over  a  time  course  (x  axis)  after  knockdown  with  EGFR  and/or 


MAST2  siRNAs  in  MDA-MB-468.  QPCR  assessment  of  knockdown 

similar  to  our  observations  with  EGFR  knockdown  in  MDA-MB-468 
cells,  here  we  observed  only  a  small  effect  on  cell  proliferation  after 
shRNA  knockdown  of  RPS6KB1,  in  dramatic  contrast  to  the  effect  of 
ERBB2  knockdown  (Figure  AD).  Notably,  the  shRNA  knockdown  of 
RPS6KB1  led  to  a  significant  depletion  of  the  full-length  protein  yet  it 
did  not  affect  cell  proliferation  compared  with  ERBB2  protein  deple¬ 
tion  (Figure  AD,  inset).  Therefore,  BT-474  cells  do  not  display  a  depen¬ 
dence  on  RPS6KB1  protein,  and  considering  that  the  RPS6KB1  fusion 
product  is  completely  devoid  of  all  functional  domains  of  RPS6KB 1 , 
including  the  kinase  domain,  this  fusion  also  likely  represents  a  pas¬ 
senger  event. 


efficiencies  relative  to  nontargeted  control  (NTC;  inset). 

Discussion 

In  our  systematic  search  for  gene  fusions  in  breast  cancer  using  high- 
throughput  transcriptome  sequencing,  we  observed  a  notably  large  num¬ 
ber  of  fusion  genes  associated  with  many  well  characterized  recurrent 
amplicons,  including  17ql2,  17q23,  20ql3,  and  8q,  among  others. 
Amplicon-associated  gene  fusions  were  found  to  involve  complex  and 
cryptic  rearrangements,  involving  one  or  both  partners  within  the  am¬ 
plicon  site,  with  the  chimeric  transcript  expression  apparently  concealed 
in  the  backdrop  of  highly  expressed  wild-type  genes.  The  gene  fusions 
considered  here  include  only  “expressed”  chimeric  transcripts  derived 
from  known/annotated  fusion  partners.  Chromosomal  rearrangements 
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that  do  not  express  chimeric  transcripts  or  that  involve  unannotated 
fusion  partners  are  excluded  from  this  analysis.  This  likely  accounts 
for  the  variability  observed  in  the  number  of  gene  fusions  scored  across 
multiple  samples  with  known  amplicons.  Because  many  of  the  fusions 
at  the  amplicons  appeared  to  be  recurrent,  although  frequently  fused 
with  multiple  different  partners,  it  led  us  to  examine  whether  the  recur¬ 
rence  was  incidentally  associated  with  recurrent  amplicons  or  signified 
functionally  important  aberrations. 

MDA-MB-468  represents  a  prototype  triple-negative  breast  cancer 
cell  line  with  a  “basal-like”  gene  expression  profile  that  shows  an 


overexpression  of  the  oncogenic  kinase  EGFR  due  to  a  focal  ampli¬ 
fication  at  chr7pl2.  Here  we  discovered  a  chimeric  transcript  involv¬ 
ing  EGFR.  However,  careful  examination  of  this  transcript  revealed 
that  the  fusion  encodes  N-terminal  EGFR  protein,  without  the  kinase 
domain.  Transcriptome  sequencing  did  not  show  evidence  of  fusion- 
associated  exon  imbalance  in  EGFR  expression,  suggesting  that  full- 
length  EGFR  is  expressed  in  this  cell  line.  In  addition,  the  significantly 
higher  genomic  copy  number  of  EGFR  compared  to  its  fusion  partner 
POLD1  suggests  that  a  minor  allelic  fraction  of  the  EGFR  is  involved 
in  fusion  with  POLD1 ,  whereas  other  amplified  copies  of  the  gene 
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Figure  4.  (A)  Normalized  expression  (RPKM)  of  RPS6KB1  in  descending  order  of  expression  in  a  panel  of  breast  cancer  samples  obtained 
from  RNA-Seq.  Schematic  representation  of  wild-type  RPS6KB1 ,  TMEM49,  and  SNF8  proteins  with  putative  breakpoints  indicated  by  red 
arrows  and  the  domain  structure  of  the  putative  fusion  proteins  in  BT-474  and  MCF7  (inset).  (B)  Plot  of  normalized  coverage  of  RPS6KB1 
transcript  in  BT-474  cell  line  showing  the  location  of  the  breakpoint  (indicated  by  red  arrow).  (C)  Bar  graph  representing  the  copy  number 
of  RPS6KB1  and  SNF8  in  BT-474  (D)  Proliferation  assay  showing  absolute  cell  count  (y  axis)  over  a  time  course  (x  axis)  after  knockdown 
with  RPS6KB1  and/or  ERBB2  shRNAs  in  BT-474.  Western  blot  assessment  of  the  knockdown  efficiency  relative  to  nontargeted  control 
(NTC).  Actin  was  used  as  a  loading  control  (inset). 
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express  the  full-length  molecule.  Technically,  the  detection  and  mon¬ 
itoring  of  the  EGFR  fusion  transcript  in  the  backdrop  of  extremely 
high  levels  of  wild-type  EGFR  transcript  is  challenging;  therefore, 
we  chose  to  assess  the  dependency  imparted  by  full-length  EGFR. 
Interestingly,  the  knockdown  of  EGFR  had  only  a  slight  effect  on 
the  proliferation  of  MDA-MB-468  cells,  whereas  a  profound  reduction 
in  cell  proliferation  was  observed  on  the  knockdown  the  fusion  gene 
MAST2.  Combined  knockdown  of  WLAST2  and  EGFR  produced  the 
same  effect  as  that  by  MAST2  alone,  further  calling  into  question  the 
credentials  of  EGFR  as  a  driver  aberration  in  MDA-MB-468  cells. 
Interestingly,  MDA-MB-468  is  known  to  be  insensitive  to  EGFR 
inhibitors  like  erlotinib  [21]  and  gefitinib  [22]. 

Similarly,  the  recurrent  gene  fusions  involving  RPS6KB1  retain 
only  the  first  exon,  and  the  chimeric  ORFs  show  a  complete  loss  of 
the  kinase  domain  in  breast  cancer  cell  lines  BT-474  and  MCF7.  Sim¬ 
ilar  to  the  EGFR  fusion,  DNA  copy  number  analysis  and  RNA-Seq 
data  provided  the  evidence  that  full-length  RPS6KB1  protein  is  en¬ 
coded  in  both  these  cell  lines.  Notably,  both  BT-474  and  MCF7  have 
been  shown  to  express  high  levels  of  full-length  RPS6KB1  protein  [23], 
suggesting  that  these  cells  exhibit  elevated  activity  of  RPS6KB1  as 
a  result  of  amplification,  independent  of  the  fusion.  Again,  similar 
to  EGFR  knockdown  in  MDA-MB-468,  RPS6KB1  knockdown  in 
BT-474  (an  ERBB2-positive  cell  line)  showed  an  insignificant  effect 
on  cell  proliferation  compared  to  ERBB2  knockdown.  Interestingly, 
in  a  previous  study,  knockdown  of  RPS6KB1  was  found  to  have  no 
effect  on  apoptosis  in  both  BT-474  and  MCF7  breast  cancer  cells  [24], 
In  the  light  of  our  observations,  we  surmise  that  repeated  breaks 
and  rejoining  of  chromosomes  during  chromosomal  amplifications 
led  to  the  generation  of  amplicon- associated  gene  fusions.  Loci  of  re¬ 
current  genomic  amplifications  thus  engender  “pseudo”  recurrent  gene 
fusions  that  may  largely  represent  passenger  aberrations  involving  ran¬ 
dom  breakpoints.  The  two  cell  lines  with  established  drivers — ERBB2 
in  BT-474  and  MAST2  in  MDA-MB-468 — made  it  possible  for  us  to 
assess  the  relative  importance  of  amplicon  fusions  involving  RRS6KB1 
and  EGFR,  respectively.  In  cases  where  a  driver  is  not  clearly  apparent, 
a  more  careful  examination  of  all  plausible  fusion  candidates  will  be 
required.  Importantly,  even  as  our  study  primarily  pertains  to  breast 
cancers  based  on  available  data  and  a  well-documented  preponderance 
of  copy  number  aberrations  in  breast  cancers  [10],  we  expect  the  asso¬ 
ciation  between  amplicons  and  gene  fusions  to  be  consistent  in  other 
cancers  as  well.  We  argue  here  for  a  measure  of  caution  in  considering 
the  functional  implications  of  recurrent  gene  fusions  associated  with 
amplifications  because  these  may  be  simply  a  result  of  massive  chromo¬ 
somal  upheaval  at  the  amplicons,  not  representing  clonally  selected 
oncogenic  events. 
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Table  Wl.  Primer  Sequences  and  siRNA/shRNA  Clone  Details. 


Gene  Symbol 

Clone  ID 

EGER 

ERBB2 

RPS6KB1 

LU-0031 14-00-0002 

SHCLNV-NM_004448 

SHCLNV-NM_003161 

Primer 

Sequence 

EGFR-fl 

EGFR-rl 

EGFR-E2 

EGFR-r2 

EGFR-f3 

EGFR-r3 

ERBB2-fl 

ERBB2-rl 

ERBB2-f2 

ERBB2-r2 

RPS6KBl-fl 

RPS6KBl-rl 

GAPDH-fl 

GAPDH-rl 

MAST2_fl 

MAST2_rl 

MAST2_f2 

MAST2_r2 

GGGCCAGGT  CTT  GAAGGCT  GT 

AT  CCCCAGGGCCACCACCAG 

ACACCCT  GGT  CT  GGAAGTACGCA 

AGT  GGGAGACTAAAGT  CAGACAGT  GAA 

CCGAGGCAGGGAATGCGTGG 

TGGCCT  GAGGCAGGCACT  CT 

TGCGCAGGCAGT  GAT  GAGAGT 

T  CT  CGGGACTGGCAGGGAGC 

T  CCT  CCT  CGCCCT  CTT  GCCC 

T  CT  CGGGACTGGCAGGGAGC 

TGCT  GACT  GGAGCACCCCCA 

GCTT  CTT  GT  GT  GAGGTAGGGAGGC 

GGCT  GAGAACGGGAAGCTT  GT  CA 

T  CT  CCATGGTGGT  GAAGACGCCA 

GAAGT  GAGT  GAGGATGGCT  GCCTT 

GAGCCGCT  CCATGCT  GCT  GTAC 

ATT  GAGGGCCAT  GGGGCAT  CT 

CCCCATAGGCGCCATTGCT  GAT  G 

Table  W2.  List  of  Gene  Fusions  Identified  in  14  Breast  Cancer  Cell  Lines,  along  with  Their  Copy  Number  Status. 
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ERBB2  Am  pi  icon  (17q12) 


I  17p13.3 


17p13.1 


17p11.2  M  17q1 1 .2 


17q21.31  1 


I  q21.33  ■ 


EEII17q24.2l 


17q25.1 


17q25.3 


H  hg18 


crfr134,300,000  1  34,400,000  1  34,500,000  1  34,600,000  1  34,700,000  1  34,800,000  1  34,900,000  1  35,000,000  1  35,100,000  1  35,200,000  1  35,300,000  1  35,400,000  1  35,500,000  1  35,600,000  1  35,700,000  1  35,800,000  1  35,900,000  1  36,000,000  1  36,100,000  1  36,200,0001 

RefSeq  Genes 


PLXDC1 

'AIIPihw  ARL5CHI 
00131347)  CACNB1HU 


RPL19# 

STAC2i1 


|  MED1 4HBBBH  | 


NEUR0D2I  TCAPI  MIEN1 1 
PPP1R1BH  Jl  L 
GRB7 1| 

|  STARD3 1 -1  1 

m  PNMTl 

PGAP3H 


ZPBP2IH  GSDMAH 


PSMD3HH  THRAHHlRAPGEFLIiti 


WIPF2  RARA1  IGFBP4H  CCR7IM  KRT222D)  KRT25H  KRT10I 


CDC6W  GJD3^ 
RARAi^NH 


KR— 26 1 
KRT24I  KRT27 1 
KRT28H 


ORMDL3  fr 

LRRC3CII  MED24H 


D 


|  IKZF3 


RPS6KB 7  Amplicon  (17q22-23) 


I  17p13.3 


17p13.1 


17p1 1 .2 


1 7q1 1 .2 


17q21.31  I 


I  q21.33 


1 17q24.2 


17q25.1 


17q25.3 


I  50,500,000  1  51,000,000  1  51,500,000  1  52,000,000  1  52,500,000  1  53,000,000  1  53,500,000  1  54,000,000  1  54,500,000  1  55,000,000  1  55,500,000  1  56,000,000  1  56,500,000  1  57,000,000  1  57,500,000  1  58,000,000  1  58,500,0001  59,000,0001  59,500,0001  60,000,0001  60,500,0001 

SKA2W  DHX^ 


TOM1L1W  HLFH 


C17orf67  ffi 
TRIM25I 

SCPEP1 I 
RNF126P1 1 
AKAP1 4 


MSI2h**MHW  SRSF1I  LPOI  RAD51C 
VEZF1 I 

MRPS23  i  OR4D2 1 MTMR4  L 
CUEDC1I  EPXl  TEX  14 

SRSF1I  LPO*  PPMIE^ 


RNFT1I  C17orf64l 
APPBP2  HH 
CA4I  PPM1DNU 
USP32  Itt-H 


TBX411  [MED13HH1 
MAHA7I  V  ~  * 


EFCAB3 ■ 
TLK2I1H 


TANC2HKH+4II  DDX421  TEX2WH  LOCI  46880"  RGS9IH 
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Figure  W1.  UCSC  tracks  displaying  the  ERRB2  and  RPS6KB1  amplicons,  with  fusion  genes  highlighted  in  yellow. 
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Figure  W2.  Graphical  representation  of  integrative  analysis  of  gene  fusions  with  copy  number  analysis.  Circos  plots  of  the  genome-wide 
distribution  of  gene  fusions  along  with  status  of  copy  number  alterations.  Red  and  green  peaks  represent  amplifications  and  deletions;  purple 
line  represents  the  fusions  associated  with  amplicons  and  nonamplicons,  respectively,  “n"  refers  to  the  total  number  of  fusions  identified. 
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Figure  W3.  Plot  of  normalized  coverage  of  MAST!  and  MAST2  transcripts  in  M4S7" fusion-positive  samples  (breakpoint  indicated  by  arrow). 
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Figure  W4.  (A)  Plot  of  normalized  coverage  of  RPS6KB1  transcript  in  BT-474,  MCF7,  and  H16N2  cell  lines.  (B)  Bar  graph  representing  the 
copy  number  of  RPS6KB1  and  TMEM49  in  MCF7. 


